New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

fix: add megaparse sdk #3419

Closed

chloedia wants to merge 8 commits into main from add/megaparse_api

Collaborator

chloedia commented Oct 22, 2024 •

edited

Loading

What it does :

Adds the MegaParse API call for parsing using the SDK

To test :
-> Go to the Megaparse repo and launch the api using this PR : QuivrHQ/MegaParse#93 (it will be merge soon)
-> Test the Megaparse file parsing

To Fix:
-> Imports errors for Brain in examples/pdf_document_from_yaml

While using these modification Megaparse is subject to a lot of change, don't forget to pull from main before launching the Megaparse API each time !

@jacopo-chevallard


          fix: add megaparse api

080f6be

dosubot bot added the size:M label

chloedia requested a review from AmineDiro

October 22, 2024 21:22

AmineDiro requested changes

View reviewed changes

core/quivr_core/processor/implementations/megaparse_processor.py Outdated Show resolved Hide resolved

core/quivr_core/processor/implementations/megaparse_processor.py Outdated Show resolved Hide resolved

core/quivr_core/processor/implementations/megaparse_processor.py

+                                  }
+                              return docs
+                          return [document]
+                      else:

Collaborator

AmineDiro Oct 24, 2024

We should probably fail here. The tasks would be retried if we raise

Collaborator Author

chloedia Oct 29, 2024

I am raising an Exception now, tell me if that is what you meant :)

core/quivr_core/processor/implementations/megaparse_processor.py Outdated

+                      if response.status_code == 200:
+                          result = response.json().get("result")
+                          document = Document(page_content=result)
+                          if len(document.page_content) > self.splitter_config.chunk_size:

Collaborator

AmineDiro Oct 24, 2024

if :

Suggested change

      
                        if len(document.page_content) > self.splitter_config.chunk_size:
          
                        if len(document.page_content) , self.splitter_config.chunk_size:

I think the splitter still work with 1 chunk size. This is preferable because we would have the chunk_size in metadata

Collaborator Author

chloedia Oct 29, 2024

Now i am just checking that the document is not empty, tell me if that is what you meant

core/quivr_core/processor/implementations/megaparse_processor.py Show resolved Hide resolved

chloedia added 2 commits

October 29, 2024 12:08


          update: Megaparse API integration

bbaa37e


          update: Megaparse API integration

d5e3424

dosubot bot added size:L and removed size:M labels

chloedia added 4 commits

October 29, 2024 14:49


          fix: examples position & fix: MegaparseProcessor Support

617167e


          fix: megaparse fail extension case

2b865af


          fix: examples position & fix: MegaparseProcessor Support

5c65d05


          fix: update processor tests

fe36a6a

dosubot bot added size:XS and removed size:L labels

Collaborator Author

chloedia commented Oct 29, 2024

Fixes : #3390

Note : Modified Examples position from ./examples to ./core/examples @AmineDiro

The examples works back again :)

chloedia requested review from AmineDiro and StanGirard

October 29, 2024 15:01

AmineDiro requested changes

View reviewed changes

core/pyproject.toml Outdated Show resolved Hide resolved

core/quivr_core/processor/implementations/megaparse_processor.py Outdated Show resolved Hide resolved

core/quivr_core/processor/implementations/megaparse_processor.py Outdated Show resolved Hide resolved

core/quivr_core/processor/implementations/megaparse_processor.py

+                              )
+                      if response.status_code == 200:
+                          result = response.json().get("result")

Collaborator

AmineDiro Oct 30, 2024

This could be None. We should deserialize into structs.

Collaborator Author

chloedia Oct 31, 2024

Not sure what you mean

core/quivr_core/config.py Outdated Show resolved Hide resolved

core/tests/processor/docx/test_docx.py


		megaparse = pytest.importorskip("megaparse")

Collaborator

AmineDiro Oct 30, 2024

This effectively skips megaparse. Do we need to run tests here ?

Collaborator Author

chloedia Oct 31, 2024

It takes too long to test it all, note that the file types are tested in megaparse

core/tests/processor/odt/test_odt.py Outdated Show resolved Hide resolved


          Fix comments

f40ad35

chloedia changed the title ~~fix: add megaparse api~~ fix: add megaparse sdk

Collaborator Author

chloedia commented Nov 7, 2024

switch to #3462

chloedia closed this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:XS