-
-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: add megaparse sdk #3419
fix: add megaparse sdk #3419
Conversation
core/quivr_core/processor/implementations/megaparse_processor.py
Outdated
Show resolved
Hide resolved
core/quivr_core/processor/implementations/megaparse_processor.py
Outdated
Show resolved
Hide resolved
} | ||
return docs | ||
return [document] | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably fail here. The tasks would be retried if we raise
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am raising an Exception now, tell me if that is what you meant :)
if response.status_code == 200: | ||
result = response.json().get("result") | ||
document = Document(page_content=result) | ||
if len(document.page_content) > self.splitter_config.chunk_size: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if :
if len(document.page_content) > self.splitter_config.chunk_size: | |
if len(document.page_content) , self.splitter_config.chunk_size: |
I think the splitter still work with 1 chunk size. This is preferable because we would have the chunk_size in metadata
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now i am just checking that the document is not empty, tell me if that is what you meant
Fixes : #3390 Note : Modified Examples position from ./examples to ./core/examples @AmineDiro The examples works back again :) |
core/quivr_core/processor/implementations/megaparse_processor.py
Outdated
Show resolved
Hide resolved
core/quivr_core/processor/implementations/megaparse_processor.py
Outdated
Show resolved
Hide resolved
) | ||
|
||
if response.status_code == 200: | ||
result = response.json().get("result") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be None. We should deserialize into structs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what you mean
|
||
megaparse = pytest.importorskip("megaparse") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This effectively skips megaparse. Do we need to run tests here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It takes too long to test it all, note that the file types are tested in megaparse
switch to #3462 |
What it does :
To test :
-> Go to the Megaparse repo and launch the api using this PR : QuivrHQ/MegaParse#93 (it will be merge soon)
-> Test the Megaparse file parsing
To Fix:
-> Imports errors for Brain in examples/pdf_document_from_yaml
While using these modification Megaparse is subject to a lot of change, don't forget to pull from main before launching the Megaparse API each time !
@jacopo-chevallard