Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Storage container names don't match storage_name parameter - is this causing indexing to fail? #47

Closed
brian-mayer opened this issue Jul 3, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@brian-mayer
Copy link

Describe the bug
I've deployed the infrastructure and it all seems to have deployed successfully. I am able to walk through the Jupyter Quickstart notebook and use the API upload the recommended sample UTF-8 text documents. Indexing 'seems' to start per the API message but stops at 6.25% or 12.5%. No indexes ever show up on the Azure AI Search instance.

To Reproduce
Steps to reproduce the behavior:

  1. Deploy accelerator solution
  2. Use Jupyter notebook Quickstart to walk through API calls
  • Upload sample UTF-8 files successfully into BLOB container, however containers have random file identifer strings - not specified storage_name parameter as the container name - example: 345yu37291db2aa8ced66f43edw5f6n7

  • Try to start an indexing job using notebook API call

  • Indexing job initiates but fails - either at 6.25% or 12.5%

Looks like this when API is queried for status
{
"status_code": 200,
"index_name": "wiki-articles-index",
"storage_name": "wiki-articles-storage",
"status": "failed",
"percent_complete": 12.5,
"progress": "2 out of 16 workflows completed successfully."

Expected behavior
I expect the index will be built so I can query it

Desktop (please complete the following information):

  • OS: MacOS
  • Version 14.4.1

Additional context
I've tried restarting the graphrag AKS containers and tried stripping down the files being processed to just one file. Nothing has altered the outcome of no apparent indexing happening. Is this related to the container names not matching the storage_name parameter input in the Jupyter Quickstart cell?

@brian-mayer brian-mayer added the bug Something isn't working label Jul 3, 2024
@jgbradley1
Copy link
Collaborator

jgbradley1 commented Jul 5, 2024

Hello @brian-mayer! The storage_name will not match the actual name of the blob container. For better security posture, we first sanitize the name provided by an API end-user by computing a hash and use that hash as the actual blob container name. The hash calculation from a user-provided storage_name string is done in this function to be exact.

@jgbradley1
Copy link
Collaborator

To assist with debugging, there is one place you can look for additional logging. In the Azure Storage instance that gets deployed within the resource group at deployment time, there will be a blob container with the name reports. That is a continuously running log of the FastAPI application so if there are errors, you might see errors logged there. Also within the blob container that is associated with the hash of the index_name you tried to build, there is a reports directory that contains a log file associated with the indexing job. That file will contain all output from running the indexing job. If you tried to run the same indexing job multiple times, there will be a separate log file per attempt.

We are looking into hooking these logs up to App Insights so you don’t have to go hunt for these log files manually. The code to support App Insights is in the codebase but has not been fully tested again due to some recent changes we made so we never turned back on this form of logging by default.

We will look into it soon and try to get better logging enabled by default again.

@jgbradley1
Copy link
Collaborator

I recently pushed a PR that hooks up graphrag to app insights. If interested, check out the latest on the main branch.

In that PR, log messages are captured in app insights as well as if any errors occur with calls to the API. Please report back if you encounter any further issues. Please note that step 2 of the pipeline (entity extraction) is responsible for a large portion of the overall indexing time (something like 90%). Once that step is complete, the other steps complete fairly quickly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants