Feature: Deep Search PDF to MD file conversion #33
+494
−101
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
WIP/POC for PDF -> MD file conversion using Deep Search
Issue #2
While the endpoint is getting established, this takes the client side typescript library and mocks up a Go gin server for mocking the file conversion API responses.
All of this code is inferring the API from the client typescript lib in
src/lib/api/deepsearch/index.ts
. It will certainly require some tuning once the endpoint gets stood up. There was one adjustment I had to make to thegetDocumentHashes
DS4SD library code removing the/
beforeapi/xxx
inpath:
api/cps/public/v2/project/${projKey}/data_indices/${indexKey}/documents/transactions/${transactionId}`` to prevent a double slash and a 404 from being returned. Typo maybe, tbd.The UI components call the mock Deep Search API for PDF to Markdown conversion with the following operations with NextJS SSRs:
v2/project/mockProject/celery_tasks/mock-task-id
.To start the Go mock deep search api server do the following:
Here are the CURL commands for functional testing and to validate the API calls. Also listing them to add clarity to the operations in the
src/app/api/conversion/route.ts
code in this PR.curl -X GET http://localhost:8080/api/cps/public/v2/project/mockProject/data_indices/mockIndex/documents/transactions/mock-transaction-id \ -H "Authorization: mock-token"
curl -X GET http://localhost:8080/api/cps/public/v2/project/mockProject/data_indices/mockIndex/documents/mock-document-hash/artifacts \ -H "Authorization: mock-token"
Here is a screencap of a basic upload. Once we get a feel for the conversion times we can consider integrating it directly into the knowledge submission form that will render the PDF doc to MD directly in the upload that then kicks off the process to get it posted in a repo and supply the location+SHA of the docs.
conversion-mockup.mp4