Integrate Context-Aware Chunking and PDF Support #284

khaledsulayman · 2024-09-23T17:18:43Z

✨ Add context-aware document chunking and table processing
add docling parser

bbrowning · 2024-09-25T00:18:05Z

I'm keeping track of the work here, but withholding any feedback until you get to a point where you think it's ready. If you need any help tracking down CI or other test failures while working on this, let me know and I'll make time.

- Implemented `build_chunks_from_docling_json` for handling mixed document elements. - Added `fuse_texts` to merge short texts with longer ones. - Integrated heading formatting and table generation from JSON and tokenizer-based chunking. Signed-off-by: Aakanksha Duggal <[email protected]> Co-authored-by: abhi1092 <[email protected]>

bbrowning

I did an initial once-over - may not have caught everything, and some of the things I caught I think will be obvious once we can run tests (including end-to-end CI tests) on this.

We'll want to add a pdf file and data generation from it to our existing end-to-end tests to exercise that code-path, as there's a lot of new logic here that we'll want to ensure works in the CI environment.

I tried to focus mostly on the issues that will either cause packaging headaches (keeping our dependencies as slim as possible), user-impacting issues running this across multiple systems (file path handling), or breakage to things like existing markdown functionality. We can clean up other bits later once we get clean end-to-end tests working for markdown and PDFs.

bbrowning · 2024-09-27T18:10:05Z

src/instructlab/sdg/utils/chunking.py

We'll need to also rename tests/test_chunking.py to tests/test_docprocessor.py and update its tests to pass in the new expected values to the chunk_documents method.

bbrowning · 2024-09-27T18:22:21Z

requirements.txt

@@ -10,5 +10,8 @@ openai>=1.13.3,<2.0.0
 #       removed once that one is removed.
 # do not use 8.4.0 due to a bug in the library
 # https://github.com/instructlab/instructlab/issues/1389
+pypdf>=5.0.0


Do we actually need pypdf here? We're using it to read the contents of the PDF files, but then we're never actually using the contents of PDF files downstream. If we really want to read the contents of the PDF, we could use something like https://github.com/DS4SD/docling-parse instead as that's already a dependency of docling anyway, so won't be bringing in anything extra?

bbrowning · 2024-09-27T18:26:15Z

requirements.txt

 tenacity>=8.3.0,!=8.4.0
+transformers>=4.44.2


Should we align this with instructlab/instructlab? Currently that's on 4.41.2 at https://github.com/instructlab/instructlab/blob/main/requirements.txt#L34

bbrowning · 2024-09-27T18:27:21Z

requirements.txt

We're missing an entry for docling here to get docling itself added as a requirement. And, since docling depends on torch, we'll want to also add a dependency on torch to pin its version to the same range as instructlab/instructlab - https://github.com/instructlab/instructlab/blob/main/requirements.txt#L31 (currently torch>=2.3.0,<2.4.0)

Here's a suggested change to requirements.txt that adds docling and aligns a couple of dependencies with the CLI repo:

diff --git a/requirements.txt b/requirements.txt index 195dddf..84bb75c 100644 --- a/requirements.txt +++ b/requirements.txt @@ -1,17 +1,20 @@ # SPDX-License-Identifier: Apache-2.0 click>=8.1.7,<9.0.0 datasets>=2.18.0,<3.0.0 +docling>=1.15.0,<2.0.0 GitPython>=3.1.42,<4.0.0 httpx>=0.25.0,<1.0.0 instructlab-schema>=0.4.0 langchain-text-splitters openai>=1.13.3,<2.0.0 +pypdf>=5.0.0 +tabulate>=0.9.0 # Note: this dependency goes along with langchain-text-splitters and may be # removed once that one is removed. # do not use 8.4.0 due to a bug in the library # https://github.com/instructlab/instructlab/issues/1389 -pypdf>=5.0.0 -tabulate>=0.9.0 tenacity>=8.3.0,!=8.4.0 -transformers>=4.44.2 +# align torch with instructlab/instructlab +torch>=2.3.0,<2.4.0 +transformers>=4.41.2 xdg-base-dirs>=6.0.1

That would be enough to at least get some of these tests running, although until test_chunking.py is adjusted it's going to just error on every run.

bbrowning · 2024-09-27T18:29:44Z

src/instructlab/sdg/utils/docprocessor.py

+# Local
+
+logger = logging.getLogger(__name__)
+DOC_FILEPATH = Path("~/.local/share/instructlab/documents").expanduser()


This path should probably be somewhere under the output_dir from generate_data.py. Since we'll create new docs here for every run, perhaps under the node_datasets_* subdirectory for that run?

bbrowning · 2024-09-27T18:31:22Z

src/instructlab/sdg/utils/docprocessor.py

+    def __init__(
+        self,
+        parsed_doc_dir: Path,
+        tokenizer_model_name: str = "TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF",


This should come from the model_name parameter passed to generate_data.py's generate_data. That way we load the tokenizer for the same teacher model that's being used.

bbrowning · 2024-09-27T18:33:54Z

src/instructlab/sdg/utils/docprocessor.py

+    parsed_pdfs = [converter.convert_single(d) for d in pdf_docs]
+    parsed_dicts = [p.render_as_dict() for p in parsed_pdfs]
+
+    docling_jsons_path = DOC_FILEPATH / "docling-jsons"


Prefer os.path.join to combine paths to string concatenation - os.path.join handles various edge cases like ensuring we don't add double slashes and more complex paths.

bbrowning · 2024-09-27T18:34:04Z

src/instructlab/sdg/utils/taxonomy.py

+                            file_contents.append(pdf_text)
+
+        if file_contents:
+            filepaths = [DOC_FILEPATH / p for p in file_patterns]


Prefer os.path.join to combine paths to string concatenation - os.path.join handles various edge cases like ensuring we don't add double slashes and more complex paths.

src/instructlab/sdg/utils/docprocessor.py

bbrowning · 2024-09-27T18:43:42Z

src/instructlab/sdg/utils/docprocessor.py

+        datasets = []
+        for json_fp in self.docling_jsons:
+            chunk_ds = self._process_parsed_docling_json(json_fp)
+            chunk_ds_with_icls = self._add_icls(chunk_ds)


Where do we add the icl values for non-PDF files? This codepath only happens for PDFs, right?

Ok, digging into this a bit more, we shouldn't be doing anything different for icl values with PDFs, right? I'm confused why we do anything with icl_query_* here at all, as that all gets handled in taxonomy.py already. What's the intent behind the separate duplicate icl handling for pdfs?

chunkers now return lists of chunks Signed-off-by: Khaled Sulayman <[email protected]>

mergify bot added ci-failure and removed ci-failure labels Sep 23, 2024

khaledsulayman force-pushed the ks-integrate-docprocessor branch 2 times, most recently from 5ca1f89 to 43d9414 Compare September 24, 2024 15:11

mergify bot added ci-failure and removed ci-failure labels Sep 24, 2024

mergify bot added the dependencies Pull requests that update a dependency file label Sep 25, 2024

khaledsulayman force-pushed the ks-integrate-docprocessor branch from 10cda56 to 1146df3 Compare September 25, 2024 14:37

aakankshaduggal force-pushed the ks-integrate-docprocessor branch from 2e224b6 to 521d5e0 Compare September 25, 2024 17:30

khaledsulayman force-pushed the ks-integrate-docprocessor branch 3 times, most recently from a4209d6 to bec5daf Compare September 26, 2024 20:25

mergify bot added ci-failure and removed ci-failure labels Sep 26, 2024

khaledsulayman force-pushed the ks-integrate-docprocessor branch 3 times, most recently from d81db63 to 2471ba9 Compare September 27, 2024 15:33

khaledsulayman force-pushed the ks-integrate-docprocessor branch from 2471ba9 to 82fa5c2 Compare September 27, 2024 15:49

bbrowning requested changes Sep 27, 2024

View reviewed changes

khaledsulayman force-pushed the ks-integrate-docprocessor branch 8 times, most recently from c79b096 to 63df047 Compare September 27, 2024 21:18

khaledsulayman force-pushed the ks-integrate-docprocessor branch from 0004d43 to 9c4649c Compare October 17, 2024 21:47

mergify bot added ci-failure and removed ci-failure labels Oct 17, 2024

khaledsulayman force-pushed the ks-integrate-docprocessor branch 3 times, most recently from 946c71e to 734245c Compare October 18, 2024 23:37

mergify bot added ci-failure and removed ci-failure labels Oct 18, 2024

khaledsulayman force-pushed the ks-integrate-docprocessor branch from 734245c to f1419a0 Compare October 18, 2024 23:43

mergify bot added ci-failure and removed ci-failure labels Oct 18, 2024

khaledsulayman force-pushed the ks-integrate-docprocessor branch from f1419a0 to 0ffc735 Compare October 19, 2024 00:07

mergify bot added ci-failure and removed ci-failure labels Oct 19, 2024

khaledsulayman force-pushed the ks-integrate-docprocessor branch from 0ffc735 to 1dcc77f Compare October 19, 2024 00:30

mergify bot added ci-failure and removed ci-failure labels Oct 19, 2024

khaledsulayman force-pushed the ks-integrate-docprocessor branch from 1dcc77f to c05a4cc Compare October 19, 2024 00:49

mergify bot added ci-failure and removed ci-failure labels Oct 19, 2024

khaledsulayman force-pushed the ks-integrate-docprocessor branch 3 times, most recently from a41da8f to a3b6c15 Compare October 19, 2024 01:57

mergify bot removed the ci-failure label Oct 19, 2024

khaledsulayman force-pushed the ks-integrate-docprocessor branch from a3b6c15 to a66f694 Compare October 19, 2024 01:58

mergify bot added the ci-failure label Oct 19, 2024

Fully export ICL mapping and Dataset construction to taxonomy.py

5b8b393

chunkers now return lists of chunks Signed-off-by: Khaled Sulayman <[email protected]>

khaledsulayman force-pushed the ks-integrate-docprocessor branch from a66f694 to 5b8b393 Compare October 20, 2024 14:32

mergify bot added ci-failure and removed ci-failure labels Oct 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate Context-Aware Chunking and PDF Support #284

Integrate Context-Aware Chunking and PDF Support #284

khaledsulayman commented Sep 23, 2024

bbrowning commented Sep 25, 2024

bbrowning left a comment

bbrowning Sep 27, 2024

bbrowning Sep 27, 2024

bbrowning Sep 27, 2024

bbrowning Sep 27, 2024

bbrowning Sep 27, 2024

bbrowning Sep 27, 2024

bbrowning Sep 27, 2024

bbrowning Sep 27, 2024

bbrowning Sep 27, 2024

bbrowning Sep 27, 2024

bbrowning Oct 2, 2024

		tenacity>=8.3.0,!=8.4.0
		transformers>=4.44.2

Integrate Context-Aware Chunking and PDF Support #284

Are you sure you want to change the base?

Integrate Context-Aware Chunking and PDF Support #284

Conversation

khaledsulayman commented Sep 23, 2024

bbrowning commented Sep 25, 2024

bbrowning left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment