-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate Context-Aware Chunking and PDF Support #284
base: main
Are you sure you want to change the base?
Integrate Context-Aware Chunking and PDF Support #284
Conversation
khaledsulayman
commented
Sep 23, 2024
- ✨ Add context-aware document chunking and table processing
- add docling parser
5ca1f89
to
43d9414
Compare
I'm keeping track of the work here, but withholding any feedback until you get to a point where you think it's ready. If you need any help tracking down CI or other test failures while working on this, let me know and I'll make time. |
10cda56
to
1146df3
Compare
2e224b6
to
521d5e0
Compare
a4209d6
to
bec5daf
Compare
d81db63
to
2471ba9
Compare
- Implemented `build_chunks_from_docling_json` for handling mixed document elements. - Added `fuse_texts` to merge short texts with longer ones. - Integrated heading formatting and table generation from JSON and tokenizer-based chunking. Signed-off-by: Aakanksha Duggal <[email protected]> Co-authored-by: abhi1092 <[email protected]>
2471ba9
to
82fa5c2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did an initial once-over - may not have caught everything, and some of the things I caught I think will be obvious once we can run tests (including end-to-end CI tests) on this.
We'll want to add a pdf file and data generation from it to our existing end-to-end tests to exercise that code-path, as there's a lot of new logic here that we'll want to ensure works in the CI environment.
I tried to focus mostly on the issues that will either cause packaging headaches (keeping our dependencies as slim as possible), user-impacting issues running this across multiple systems (file path handling), or breakage to things like existing markdown functionality. We can clean up other bits later once we get clean end-to-end tests working for markdown and PDFs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll need to also rename tests/test_chunking.py
to tests/test_docprocessor.py
and update its tests to pass in the new expected values to the chunk_documents
method.
@@ -10,5 +10,8 @@ openai>=1.13.3,<2.0.0 | |||
# removed once that one is removed. | |||
# do not use 8.4.0 due to a bug in the library | |||
# https://github.com/instructlab/instructlab/issues/1389 | |||
pypdf>=5.0.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we actually need pypdf here? We're using it to read the contents of the PDF files, but then we're never actually using the contents of PDF files downstream. If we really want to read the contents of the PDF, we could use something like https://github.com/DS4SD/docling-parse instead as that's already a dependency of docling anyway, so won't be bringing in anything extra?
tenacity>=8.3.0,!=8.4.0 | ||
transformers>=4.44.2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we align this with instructlab/instructlab? Currently that's on 4.41.2 at https://github.com/instructlab/instructlab/blob/main/requirements.txt#L34
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're missing an entry for docling
here to get docling itself added as a requirement. And, since docling depends on torch, we'll want to also add a dependency on torch to pin its version to the same range as instructlab/instructlab - https://github.com/instructlab/instructlab/blob/main/requirements.txt#L31 (currently torch>=2.3.0,<2.4.0
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's a suggested change to requirements.txt that adds docling and aligns a couple of dependencies with the CLI repo:
diff --git a/requirements.txt b/requirements.txt
index 195dddf..84bb75c 100644
--- a/requirements.txt
+++ b/requirements.txt
@@ -1,17 +1,20 @@
# SPDX-License-Identifier: Apache-2.0
click>=8.1.7,<9.0.0
datasets>=2.18.0,<3.0.0
+docling>=1.15.0,<2.0.0
GitPython>=3.1.42,<4.0.0
httpx>=0.25.0,<1.0.0
instructlab-schema>=0.4.0
langchain-text-splitters
openai>=1.13.3,<2.0.0
+pypdf>=5.0.0
+tabulate>=0.9.0
# Note: this dependency goes along with langchain-text-splitters and may be
# removed once that one is removed.
# do not use 8.4.0 due to a bug in the library
# https://github.com/instructlab/instructlab/issues/1389
-pypdf>=5.0.0
-tabulate>=0.9.0
tenacity>=8.3.0,!=8.4.0
-transformers>=4.44.2
+# align torch with instructlab/instructlab
+torch>=2.3.0,<2.4.0
+transformers>=4.41.2
xdg-base-dirs>=6.0.1
That would be enough to at least get some of these tests running, although until test_chunking.py is adjusted it's going to just error on every run.
# Local | ||
|
||
logger = logging.getLogger(__name__) | ||
DOC_FILEPATH = Path("~/.local/share/instructlab/documents").expanduser() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This path should probably be somewhere under the output_dir
from generate_data.py. Since we'll create new docs here for every run, perhaps under the node_datasets_*
subdirectory for that run?
def __init__( | ||
self, | ||
parsed_doc_dir: Path, | ||
tokenizer_model_name: str = "TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should come from the model_name parameter passed to generate_data.py's generate_data
. That way we load the tokenizer for the same teacher model that's being used.
parsed_pdfs = [converter.convert_single(d) for d in pdf_docs] | ||
parsed_dicts = [p.render_as_dict() for p in parsed_pdfs] | ||
|
||
docling_jsons_path = DOC_FILEPATH / "docling-jsons" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prefer os.path.join
to combine paths to string concatenation - os.path.join
handles various edge cases like ensuring we don't add double slashes and more complex paths.
file_contents.append(pdf_text) | ||
|
||
if file_contents: | ||
filepaths = [DOC_FILEPATH / p for p in file_patterns] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prefer os.path.join
to combine paths to string concatenation - os.path.join
handles various edge cases like ensuring we don't add double slashes and more complex paths.
datasets = [] | ||
for json_fp in self.docling_jsons: | ||
chunk_ds = self._process_parsed_docling_json(json_fp) | ||
chunk_ds_with_icls = self._add_icls(chunk_ds) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do we add the icl values for non-PDF files? This codepath only happens for PDFs, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, digging into this a bit more, we shouldn't be doing anything different for icl values with PDFs, right? I'm confused why we do anything with icl_query_* here at all, as that all gets handled in taxonomy.py already. What's the intent behind the separate duplicate icl handling for pdfs?
c79b096
to
63df047
Compare
0004d43
to
9c4649c
Compare
946c71e
to
734245c
Compare
734245c
to
f1419a0
Compare
f1419a0
to
0ffc735
Compare
0ffc735
to
1dcc77f
Compare
1dcc77f
to
c05a4cc
Compare
a41da8f
to
a3b6c15
Compare
a3b6c15
to
a66f694
Compare
chunkers now return lists of chunks Signed-off-by: Khaled Sulayman <[email protected]>
a66f694
to
5b8b393
Compare