Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate Context-Aware Chunking and PDF Support #284

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

khaledsulayman
Copy link
Member

  • ✨ Add context-aware document chunking and table processing
  • add docling parser

@mergify mergify bot added ci-failure and removed ci-failure labels Sep 23, 2024
@khaledsulayman khaledsulayman force-pushed the ks-integrate-docprocessor branch 2 times, most recently from 5ca1f89 to 43d9414 Compare September 24, 2024 15:11
@mergify mergify bot added ci-failure and removed ci-failure labels Sep 24, 2024
@bbrowning
Copy link
Contributor

I'm keeping track of the work here, but withholding any feedback until you get to a point where you think it's ready. If you need any help tracking down CI or other test failures while working on this, let me know and I'll make time.

@mergify mergify bot added the dependencies Pull requests that update a dependency file label Sep 25, 2024
@khaledsulayman khaledsulayman force-pushed the ks-integrate-docprocessor branch 3 times, most recently from a4209d6 to bec5daf Compare September 26, 2024 20:25
@mergify mergify bot added ci-failure and removed ci-failure labels Sep 26, 2024
@khaledsulayman khaledsulayman force-pushed the ks-integrate-docprocessor branch 3 times, most recently from d81db63 to 2471ba9 Compare September 27, 2024 15:33
- Implemented `build_chunks_from_docling_json` for handling mixed document elements.
- Added `fuse_texts` to merge short texts with longer ones.
- Integrated heading formatting and table generation from JSON and tokenizer-based chunking.

Signed-off-by: Aakanksha Duggal <[email protected]>
Co-authored-by: abhi1092 <[email protected]>
Copy link
Contributor

@bbrowning bbrowning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did an initial once-over - may not have caught everything, and some of the things I caught I think will be obvious once we can run tests (including end-to-end CI tests) on this.

We'll want to add a pdf file and data generation from it to our existing end-to-end tests to exercise that code-path, as there's a lot of new logic here that we'll want to ensure works in the CI environment.

I tried to focus mostly on the issues that will either cause packaging headaches (keeping our dependencies as slim as possible), user-impacting issues running this across multiple systems (file path handling), or breakage to things like existing markdown functionality. We can clean up other bits later once we get clean end-to-end tests working for markdown and PDFs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll need to also rename tests/test_chunking.py to tests/test_docprocessor.py and update its tests to pass in the new expected values to the chunk_documents method.

@@ -10,5 +10,8 @@ openai>=1.13.3,<2.0.0
# removed once that one is removed.
# do not use 8.4.0 due to a bug in the library
# https://github.com/instructlab/instructlab/issues/1389
pypdf>=5.0.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually need pypdf here? We're using it to read the contents of the PDF files, but then we're never actually using the contents of PDF files downstream. If we really want to read the contents of the PDF, we could use something like https://github.com/DS4SD/docling-parse instead as that's already a dependency of docling anyway, so won't be bringing in anything extra?

tenacity>=8.3.0,!=8.4.0
transformers>=4.44.2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we align this with instructlab/instructlab? Currently that's on 4.41.2 at https://github.com/instructlab/instructlab/blob/main/requirements.txt#L34

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're missing an entry for docling here to get docling itself added as a requirement. And, since docling depends on torch, we'll want to also add a dependency on torch to pin its version to the same range as instructlab/instructlab - https://github.com/instructlab/instructlab/blob/main/requirements.txt#L31 (currently torch>=2.3.0,<2.4.0)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a suggested change to requirements.txt that adds docling and aligns a couple of dependencies with the CLI repo:

diff --git a/requirements.txt b/requirements.txt
index 195dddf..84bb75c 100644
--- a/requirements.txt
+++ b/requirements.txt
@@ -1,17 +1,20 @@
 # SPDX-License-Identifier: Apache-2.0
 click>=8.1.7,<9.0.0
 datasets>=2.18.0,<3.0.0
+docling>=1.15.0,<2.0.0
 GitPython>=3.1.42,<4.0.0
 httpx>=0.25.0,<1.0.0
 instructlab-schema>=0.4.0
 langchain-text-splitters
 openai>=1.13.3,<2.0.0
+pypdf>=5.0.0
+tabulate>=0.9.0
 # Note: this dependency goes along with langchain-text-splitters and may be
 #       removed once that one is removed.
 # do not use 8.4.0 due to a bug in the library
 # https://github.com/instructlab/instructlab/issues/1389
-pypdf>=5.0.0
-tabulate>=0.9.0
 tenacity>=8.3.0,!=8.4.0
-transformers>=4.44.2
+# align torch with instructlab/instructlab
+torch>=2.3.0,<2.4.0
+transformers>=4.41.2
 xdg-base-dirs>=6.0.1

That would be enough to at least get some of these tests running, although until test_chunking.py is adjusted it's going to just error on every run.

# Local

logger = logging.getLogger(__name__)
DOC_FILEPATH = Path("~/.local/share/instructlab/documents").expanduser()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This path should probably be somewhere under the output_dir from generate_data.py. Since we'll create new docs here for every run, perhaps under the node_datasets_* subdirectory for that run?

def __init__(
self,
parsed_doc_dir: Path,
tokenizer_model_name: str = "TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should come from the model_name parameter passed to generate_data.py's generate_data. That way we load the tokenizer for the same teacher model that's being used.

parsed_pdfs = [converter.convert_single(d) for d in pdf_docs]
parsed_dicts = [p.render_as_dict() for p in parsed_pdfs]

docling_jsons_path = DOC_FILEPATH / "docling-jsons"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer os.path.join to combine paths to string concatenation - os.path.join handles various edge cases like ensuring we don't add double slashes and more complex paths.

file_contents.append(pdf_text)

if file_contents:
filepaths = [DOC_FILEPATH / p for p in file_patterns]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer os.path.join to combine paths to string concatenation - os.path.join handles various edge cases like ensuring we don't add double slashes and more complex paths.

src/instructlab/sdg/utils/docprocessor.py Outdated Show resolved Hide resolved
datasets = []
for json_fp in self.docling_jsons:
chunk_ds = self._process_parsed_docling_json(json_fp)
chunk_ds_with_icls = self._add_icls(chunk_ds)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do we add the icl values for non-PDF files? This codepath only happens for PDFs, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, digging into this a bit more, we shouldn't be doing anything different for icl values with PDFs, right? I'm confused why we do anything with icl_query_* here at all, as that all gets handled in taxonomy.py already. What's the intent behind the separate duplicate icl handling for pdfs?

@khaledsulayman khaledsulayman force-pushed the ks-integrate-docprocessor branch 8 times, most recently from c79b096 to 63df047 Compare September 27, 2024 21:18
@mergify mergify bot added ci-failure and removed ci-failure labels Oct 17, 2024
@khaledsulayman khaledsulayman force-pushed the ks-integrate-docprocessor branch 3 times, most recently from 946c71e to 734245c Compare October 18, 2024 23:37
@mergify mergify bot added ci-failure and removed ci-failure labels Oct 18, 2024
@mergify mergify bot added ci-failure and removed ci-failure labels Oct 18, 2024
@mergify mergify bot added ci-failure and removed ci-failure labels Oct 19, 2024
@mergify mergify bot added ci-failure and removed ci-failure labels Oct 19, 2024
@mergify mergify bot added ci-failure and removed ci-failure labels Oct 19, 2024
@khaledsulayman khaledsulayman force-pushed the ks-integrate-docprocessor branch 3 times, most recently from a41da8f to a3b6c15 Compare October 19, 2024 01:57
@mergify mergify bot removed the ci-failure label Oct 19, 2024
@mergify mergify bot added the ci-failure label Oct 19, 2024
chunkers now return lists of chunks

Signed-off-by: Khaled Sulayman <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci-failure dependencies Pull requests that update a dependency file
Projects
None yet
Development

Successfully merging this pull request may close these issues.

context-aware chunking
4 participants