Add media description feature using Azure Content Understanding #2195

pamelafox · 2024-11-25T23:49:49Z

Purpose

This PR adds a new optional feature that will extract figures in documents (using Azure Document Intelligence) figures output mode and send those figures to Azure Content Understanding (a new service that uses multimodal models) to generate a figure description. It will then insert that figure description into the content, which will then get sent for chunking. If the figure is of a graph or chart, it will include an HTML table with the data.
This gives developers a more lightweight approach to ingest media-rich documents and can be compared to the more heavyweight GPT-4-vision approach.

Does this introduce a breaking change?

When developers merge from main and run the server, azd up, or azd deploy, will this produce an error?
If you're not sure, try it out on an old environment.

[ ] Yes
[X] No

Does this require changes to learn.microsoft.com docs?

This repository is referenced by this tutorial
which includes deployment, settings and usage instructions. If text or screenshot need to change in the tutorial,
check the box below and notify the tutorial author. A Microsoft employee can do this for you if you're an external contributor.

[ ] Yes
[X] No

Type of change

[ ] Bugfix
[X] Feature
[ ] Code style update (formatting, local variables)
[ ] Refactoring (no functional changes, no api changes)
[ ] Documentation content changes
[ ] Other... Please describe:

Code quality checklist

See CONTRIBUTING.md for more details.

The current tests all pass (python -m pytest).
I added tests that prove my fix is effective or that my feature works
I ran python -m pytest --cov to verify 100% coverage of added lines
I ran python -m mypy to check for type errors
I either used the pre-commit hooks or ran ruff and black manually on my code.

github-actions · 2024-11-25T23:50:08Z

Check Broken Paths

We have automatically detected the following broken relative paths in your files.
Review and fix the paths to resolve this issue.

Check the file paths and associated broken paths inside them. For more details, check our Contributing Guide.

File Full Path Issues

CONTRIBUTING.md

#	Link	Line Number
1	`./main.parameters.json`	`169`
2	`./main.bicep`	`170`

github-actions · 2024-11-25T23:50:24Z

Check Country Locale in URLs

We have automatically detected added country locale to URLs in your files.
Review and remove country-specific locale from URLs to resolve this issue.

Check the file paths and associated URLs inside them. For more details, check our Contributing Guide.

File Full Path Issues

README.md

#	Link	Line Number
1	`https://azure.microsoft.com/en-us/pricing/details/cognitive-services/computer-vision/`	`96`

pamelafox · 2024-11-25T23:51:09Z

I need to do some cleanup, per the CI. This also needs a few more tests, as it currently only has a test for the changes to the splitting algorithm.

CONTRIBUTING.md

mattgotteiner · 2024-12-02T19:10:01Z

app/backend/prepdocslib/figure_output.json

@@ -0,0 +1,127 @@
+"figures": [


should we pull this into the data directory? or not check this in at all?

app/backend/prepdocslib/pdfparser.py

github-actions · 2024-12-04T21:03:45Z

Check Broken Paths

We have automatically detected the following broken relative paths in your files.
Review and fix the paths to resolve this issue.

Check the file paths and associated broken paths inside them. For more details, check our Contributing Guide.

File Full Path Issues

CONTRIBUTING.md

#	Link	Line Number
1	`./main.parameters.json`	`169`
2	`./main.bicep`	`170`

github-actions · 2024-12-04T21:04:01Z

Check Country Locale in URLs

We have automatically detected added country locale to URLs in your files.
Review and remove country-specific locale from URLs to resolve this issue.

Check the file paths and associated URLs inside them. For more details, check our Contributing Guide.

File Full Path Issues

README.md

#	Link	Line Number
1	`https://azure.microsoft.com/en-us/pricing/details/cognitive-services/computer-vision/`	`96`

…zure-search-openai-demo into contentunderstanding

pamelafox · 2024-12-07T00:19:46Z

I have now completed both unit testing and manual testing, and this is good to merge. We will merge/release on Monday.

pamelafox added 6 commits November 19, 2024 16:42

First pass

c19a9f3

CU kinda working

7b52dac

CU integration

65e5616

Better splitting

7130a24

Add Bicep

9ba6e3a

Rm unneeded figures

c621a43

mattgotteiner reviewed Dec 2, 2024

View reviewed changes

CONTRIBUTING.md Outdated Show resolved Hide resolved

mattgotteiner reviewed Dec 2, 2024

View reviewed changes

app/backend/prepdocslib/pdfparser.py Outdated Show resolved Hide resolved

mattgotteiner approved these changes Dec 2, 2024

View reviewed changes

Remove en-us from URLs

0fef108

pamelafox and others added 14 commits December 4, 2024 13:16

Fix URLs

93e774d

Remote figures output JSON

3b104fb

Update matrix comments

9973a77

Merge branch 'main' into contentunderstanding

109d7d4

Make mypy happy

0681755

Merge branch 'contentunderstanding' of https://github.com/pamelafox/a…

400d313

…zure-search-openai-demo into contentunderstanding

Add same errors to file strategy

ec66c52

Add pymupdf to skip modules for mypy

5a3040a

Output the endpoint from Bicep

2a6e604

100 percent coverage for mediadescriber.py

b8c4d94

Tests added for PDFParser

8ec9514

Fix that tuple type

6d4e756

Add pricing link

75b159d

Fix content read issue

c88b5d5

pamelafox merged commit 0bb3f95 into Azure-Samples:main Dec 9, 2024
16 checks passed

pamelafox deleted the contentunderstanding branch December 9, 2024 18:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add media description feature using Azure Content Understanding #2195

Add media description feature using Azure Content Understanding #2195

pamelafox commented Nov 25, 2024 •

edited

Loading

github-actions bot commented Nov 25, 2024

github-actions bot commented Nov 25, 2024

pamelafox commented Nov 25, 2024

mattgotteiner Dec 2, 2024

github-actions bot commented Dec 4, 2024

github-actions bot commented Dec 4, 2024

pamelafox commented Dec 7, 2024

Add media description feature using Azure Content Understanding #2195

Add media description feature using Azure Content Understanding #2195

Conversation

pamelafox commented Nov 25, 2024 • edited Loading

Purpose

Does this introduce a breaking change?

Does this require changes to learn.microsoft.com docs?

Type of change

Code quality checklist

github-actions bot commented Nov 25, 2024

Check Broken Paths

github-actions bot commented Nov 25, 2024

Check Country Locale in URLs

pamelafox commented Nov 25, 2024

mattgotteiner Dec 2, 2024

Choose a reason for hiding this comment

github-actions bot commented Dec 4, 2024

Check Broken Paths

github-actions bot commented Dec 4, 2024

Check Country Locale in URLs

pamelafox commented Dec 7, 2024

pamelafox commented Nov 25, 2024 •

edited

Loading