[BUG] - "Indexing failed at 12.5 %" #139

doruit · 2024-08-13T07:48:09Z

Describe the bug
Stuck at the indexing job. After this message:

<Response [200]>
{"status":"Indexing operation scheduled"}

I'm checking the status every now and then, after a while get this:

{
'status_code': 200,
'index_name': 'index-2',
'storage_name': 'testdata1',
'status': 'failed',
'percent_complete': 12.5,
'progress': '2 out of 16 workflows completed successfully.',
}

To Reproduce
Steps to reproduce the behavior:

Follow deployment guide
Downloaded a small set of wikipedia articles
Install all dependencies for the Quickstart notebook "1-Quickstart.ipynb"
Run the notebook
Validated that all steps until the indexing job run successful
At step "Build an Index" the response is "{"status":"Indexing operation scheduled"}", however it does not seem to be created
At the step "Check status of an indexing job" it got stuck at 'percent-complete': 12.5
I checked AI Search Service if the index is created at some point but it was not created at all

Expected behavior
At the indexing job i expect the job to finish succesfully.

Screenshots
n/a

Desktop (please complete the following information):

OS: MacOS

Additional context
n/a

The text was updated successfully, but these errors were encountered:

timothymeyers · 2024-08-16T17:16:07Z

Any luck @doruit? Did you happen to try running again?

When you kick of an indexing run, a kubernetes job is spun up (within about 5 mins). If you ran deploy.sh, you should be able to

watch kubectl get jobs -n graphrag

and wait for the indexing job to appear. Then

kubectl logs job/<indexing job name> -n graphrag -f

to watch the logs to monitor progress. You'll possibly see some 503 and 429 errors, which is normal as the indexer runs out of tokens and has to wait for the rate limiter to let it back in. (There's ongoing work to clean this up)

But, if for some reason your indexer dies you would be able to see what happened when it did.

doruit · 2024-08-19T14:28:19Z

@timothymeyers, just did a fresh deployment to rule out some possible causes.....

I've checked the storage account, it seems that the files are uploaded to a container with a random name where i expected a name i declared in the notebook:

file_directory = "testdata"
storage_name = "testdata"
index_name = "index1"

However, the files are uploaded in a container with the number as name instead:

Is this expected?

rnpramasamyai · 2024-08-20T10:01:48Z

@doruit Please check logs of your indexing pod and you will get an idea.

timothymeyers · 2024-08-20T13:26:07Z

However, the files are uploaded in a container with the number as name instead. Is this expected?

Hi @doruit - yes this is the expected behavior. The names that you give are hashed to improve the overall security posture.

Did you run into the same issues during indexing with your new deployment? Did you happen to try inspecting the index pod logs like I mentioned?

doruit · 2024-08-20T14:26:20Z

Hi @timothymeyers, earlier i saw in the indexing pod logs that the token limit is reached many times. To me strange as i'm using the following TPM settings:

Should be sufficient right? I have also turned of dynamic quota allocation.

When looking at the monitor of jobs it says no jobs running:

When checking the job status from the notebook at the same time it says:

rnpramasamyai · 2024-08-21T03:01:58Z

@doruit, could you please add the api_key property under each LLM node in the following file: pipeline-settings.yaml?.

doruit · 2024-08-21T08:14:39Z

@rnpramasamyai, i've added the api_key property:

After this i did a rerun of the Quickstart notebook to build a new index:

But now the indexing manager does not seem to instantiate an indexing job at all.

Should i remove the graphrag namespace and run the deployment again ?

rnpramasamyai · 2024-08-21T09:21:23Z

@doruit. Please run deployment script again.

doruit · 2024-08-21T12:57:50Z

Deployment was successful however indexing is still not working. Should the API version match the value from the deployment documentation or the API version that is mentioned in the Playground > View Code window:

rnpramasamyai · 2024-08-22T03:34:26Z

@doruit Please always check the pod's logs if indexing is not working and post that logs.

doruit · 2024-08-22T11:31:31Z

I did a full deployment again, check all parameters and ran the notebook again from the start. After running the step "Build an Index" i get this message:

{
    'status_code': 200,
    'index_name': 'index7',
    'storage_name': 'testdata',
    'status': 'scheduled',
    'percent_complete': 0.0,
    'progress': '',
}

At the same time i'm watching the logs and wait for the indexing job to come-by but i only get messages from the graphrag index manager every 5 minutes:

Every 2.0s: kubectl get jobs -...  SandboxHost-638599057829007509: Thu Aug 22 11:25:28 2024

NAME                              COMPLETIONS   DURATION   AGE
graphrag-index-manager-28738765   1/1           25s        28s

This is my parameters file:

{
  "GRAPHRAG_API_BASE": "https://aoai-graphrag-tst-francecentral.openai.azure.com",
  "GRAPHRAG_API_VERSION": "2024-02-15-preview",
  "GRAPHRAG_EMBEDDING_DEPLOYMENT_NAME": "text-embedding-ada-002",
  "GRAPHRAG_EMBEDDING_MODEL": "text-embedding-ada-002",
  "GRAPHRAG_LLM_DEPLOYMENT_NAME": "gpt-4o",
  "GRAPHRAG_LLM_MODEL": "gpt-4o",
  "LOCATION": "francecentral",
  "RESOURCE_GROUP": "rg-graphrag-tst-04"
}

Not sure where to look now as the indexing job does not start at all anymore. What region, LLM model version, API version, etc should i use as reference?

rnpramasamyai · 2024-08-22T11:51:55Z

@doruit Indexing will take time to complete.

doruit · 2024-08-22T13:51:23Z

@rnpramasamyai, i've waited for an hour, but it seems it will not start nor can i find any clue where to look for errors. In the job log i only see this message every 5 minutes:

What else can i check/ rule out?

doruit · 2024-08-23T13:07:23Z

@rnpramasamyai @timothymeyers
I changed from a CSP tenant to deploy to my MSDN tenant/subscription, did the full deployment and it now seem to work:

HTTP/1.1 200 OK
content-length: 172
content-type: application/json
date: Fri, 23 Aug 2024 11:53:38 GMT
request-context: appId=cid-v1:xxxxxxxxxxxxxx
vary: Origin
    
{
    "status_code": 200,
    "index_name": "index1",
    "storage_name": "testdata",
    "status": "complete",
    "percent_complete": 100.0,
    "progress": "16 out of 16 workflows completed successfully."
}

I checked if quota or Azure policy caused the issue in the CSP tenant/subscription, however i could not find any log so for to rule everything out.

There is only 1 policy that might impact the creation of VM/VMSS. The policy requires VMs to have managed disks, which they all have so i guess the policy won't block anything. Other policy blocks creating classic resources.

However, the good news is that with the alternative method the deployment was successful.

eai-douglaswross · 2024-08-28T03:47:43Z

Firstly: thank you for this repo, and thanks for trying to help us punters understand what you have written.

I do have the same issues with stopping at 2/16 workflows. 12.5% .
I do not have a MSDN tenant, however we do not have any policies specifically added into our tenant. It is very new, and out of the box.

Pod log command, does not seem to work??? tried with both names when job was running:

graphrag-solution-accelerator-py3.10vscode@docker-desktop:/graphrag-accelerator$ kubectl logs job/graphrag-index-manager-28746945 -n graphrag -f
Indexing job for 'indtestdata' already running. Will not schedule another. Exiting...
graphrag-solution-accelerator-py3.10vscode@docker-desktop:/graphrag-accelerator$ kubectl logs job/indtestdata -n graphrag -f
error: error from server (NotFound): jobs.batch "indtestdata" not found in namespace "graphrag"

Can I suggest / request - as it may make everyone's job a little easier:

Enable Azure AI Search access from Portal when in DEV deployment more: Can you set some variable in the deployment like deployment_type=<dev/prod> so that all of the blocking of Azure Portal access to the AI search Index is turned off in dev mode, and then locked down for a prod deployment.
Add the option for a VM into deployment of the infra into the private network, so that we can use this method: https://learn.microsoft.com/en-gb/azure/search/service-create-private-endpoint#use-the-azure-portal-to-access-a-private-search-service
Provide some instructions about manually putting a VM in the private network via azure online portal, so we can remote to it and access the Azure Portal functionality like suggested in that link

i.e. It is very difficult to see what is going on, to try to understand what is going wrong.

Lastly, when you add a comment like:

@doruit, could you please add the api_key property under each LLM node in the following file: pipeline-settings.yaml?.

For the rest of us trying to follow along, would you mind telling us quickly why you are suggesting that, so that we can also understand why it might fix the issue.

doruit · 2024-08-28T08:47:34Z

I still don't know what caused the process to get stuck. It was not due to Azure policy or api_key in the pipeline-settings.yaml. Perhaps the model and API version might cause the issue. In other issue threads they mention that if the vector size is slightly different from what is expected, the indexing will fail.

@timothymeyers, in deployment.md it looks like the API version is fixed to "2023-03-15-preview". Is that correct? or should the documentation instruct the developer to get the right API version from the deployed model (i.e. via the portal)?

Daksh-S97 · 2024-09-17T21:20:16Z

Hi, I'm facing a similar issue. I'm running it on a dev container. Which API version did you end up using?

doruit added the bug Something isn't working label Aug 13, 2024

doruit changed the title ~~[BUG]~~ [BUG] - "Indexing failed at 12.5 %" Aug 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] - "Indexing failed at 12.5 %" #139

[BUG] - "Indexing failed at 12.5 %" #139

doruit commented Aug 13, 2024

timothymeyers commented Aug 16, 2024

doruit commented Aug 19, 2024

rnpramasamyai commented Aug 20, 2024

timothymeyers commented Aug 20, 2024

doruit commented Aug 20, 2024

rnpramasamyai commented Aug 21, 2024

doruit commented Aug 21, 2024

rnpramasamyai commented Aug 21, 2024

doruit commented Aug 21, 2024

rnpramasamyai commented Aug 22, 2024

doruit commented Aug 22, 2024

rnpramasamyai commented Aug 22, 2024

doruit commented Aug 22, 2024

doruit commented Aug 23, 2024 •

edited

Loading

eai-douglaswross commented Aug 28, 2024 •

edited

Loading

doruit commented Aug 28, 2024

Daksh-S97 commented Sep 17, 2024

[BUG] - "Indexing failed at 12.5 %" #139

[BUG] - "Indexing failed at 12.5 %" #139

Comments

doruit commented Aug 13, 2024

timothymeyers commented Aug 16, 2024

doruit commented Aug 19, 2024

rnpramasamyai commented Aug 20, 2024

timothymeyers commented Aug 20, 2024

doruit commented Aug 20, 2024

rnpramasamyai commented Aug 21, 2024

doruit commented Aug 21, 2024

rnpramasamyai commented Aug 21, 2024

doruit commented Aug 21, 2024

rnpramasamyai commented Aug 22, 2024

doruit commented Aug 22, 2024

rnpramasamyai commented Aug 22, 2024

doruit commented Aug 22, 2024

doruit commented Aug 23, 2024 • edited Loading

eai-douglaswross commented Aug 28, 2024 • edited Loading

doruit commented Aug 28, 2024

Daksh-S97 commented Sep 17, 2024

doruit commented Aug 23, 2024 •

edited

Loading

eai-douglaswross commented Aug 28, 2024 •

edited

Loading