Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: add megaparse sdk #3419

Closed
wants to merge 8 commits into from
Closed

fix: add megaparse sdk #3419

wants to merge 8 commits into from

Conversation

chloedia
Copy link
Collaborator

@chloedia chloedia commented Oct 22, 2024

What it does :

  • Adds the MegaParse API call for parsing using the SDK

To test :
-> Go to the Megaparse repo and launch the api using this PR : QuivrHQ/MegaParse#93 (it will be merge soon)
-> Test the Megaparse file parsing

To Fix:
-> Imports errors for Brain in examples/pdf_document_from_yaml

While using these modification Megaparse is subject to a lot of change, don't forget to pull from main before launching the Megaparse API each time !

@jacopo-chevallard

@dosubot dosubot bot added the size:M This PR changes 30-99 lines, ignoring generated files. label Oct 22, 2024
@chloedia chloedia requested a review from AmineDiro October 22, 2024 21:22
}
return docs
return [document]
else:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably fail here. The tasks would be retried if we raise

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am raising an Exception now, tell me if that is what you meant :)

if response.status_code == 200:
result = response.json().get("result")
document = Document(page_content=result)
if len(document.page_content) > self.splitter_config.chunk_size:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if :

Suggested change
if len(document.page_content) > self.splitter_config.chunk_size:
if len(document.page_content) , self.splitter_config.chunk_size:

I think the splitter still work with 1 chunk size. This is preferable because we would have the chunk_size in metadata

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now i am just checking that the document is not empty, tell me if that is what you meant

@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels Oct 29, 2024
@dosubot dosubot bot added size:XS This PR changes 0-9 lines, ignoring generated files. and removed size:L This PR changes 100-499 lines, ignoring generated files. labels Oct 29, 2024
@chloedia
Copy link
Collaborator Author

Fixes : #3390

Note : Modified Examples position from ./examples to ./core/examples @AmineDiro

The examples works back again :)

core/pyproject.toml Outdated Show resolved Hide resolved
)

if response.status_code == 200:
result = response.json().get("result")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be None. We should deserialize into structs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you mean

core/quivr_core/config.py Outdated Show resolved Hide resolved

megaparse = pytest.importorskip("megaparse")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This effectively skips megaparse. Do we need to run tests here ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It takes too long to test it all, note that the file types are tested in megaparse

core/tests/processor/odt/test_odt.py Outdated Show resolved Hide resolved
@chloedia chloedia changed the title fix: add megaparse api fix: add megaparse sdk Nov 7, 2024
@chloedia
Copy link
Collaborator Author

chloedia commented Nov 7, 2024

switch to #3462

@chloedia chloedia closed this Nov 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:XS This PR changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants