Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] - "Indexing failed at 12.5 %" #139

Open
doruit opened this issue Aug 13, 2024 · 17 comments
Open

[BUG] - "Indexing failed at 12.5 %" #139

doruit opened this issue Aug 13, 2024 · 17 comments
Labels
bug Something isn't working

Comments

@doruit
Copy link

doruit commented Aug 13, 2024

Describe the bug
Stuck at the indexing job. After this message:

<Response [200]>
{"status":"Indexing operation scheduled"}

I'm checking the status every now and then, after a while get this:

{
'status_code': 200,
'index_name': 'index-2',
'storage_name': 'testdata1',
'status': 'failed',
'percent_complete': 12.5,
'progress': '2 out of 16 workflows completed successfully.',
}

To Reproduce
Steps to reproduce the behavior:

  1. Follow deployment guide
  2. Downloaded a small set of wikipedia articles
  3. Install all dependencies for the Quickstart notebook "1-Quickstart.ipynb"
  4. Run the notebook
  5. Validated that all steps until the indexing job run successful
  6. At step "Build an Index" the response is "{"status":"Indexing operation scheduled"}", however it does not seem to be created
  7. At the step "Check status of an indexing job" it got stuck at 'percent-complete': 12.5
  8. I checked AI Search Service if the index is created at some point but it was not created at all

Expected behavior
At the indexing job i expect the job to finish succesfully.

Screenshots
n/a

Desktop (please complete the following information):

  • OS: MacOS

Additional context
n/a

@doruit doruit added the bug Something isn't working label Aug 13, 2024
@doruit doruit changed the title [BUG] [BUG] - "Indexing failed at 12.5 %" Aug 14, 2024
@timothymeyers
Copy link
Contributor

Any luck @doruit? Did you happen to try running again?

When you kick of an indexing run, a kubernetes job is spun up (within about 5 mins). If you ran deploy.sh, you should be able to

watch kubectl get jobs -n graphrag

and wait for the indexing job to appear. Then

kubectl logs job/<indexing job name> -n graphrag -f

to watch the logs to monitor progress. You'll possibly see some 503 and 429 errors, which is normal as the indexer runs out of tokens and has to wait for the rate limiter to let it back in. (There's ongoing work to clean this up)

But, if for some reason your indexer dies you would be able to see what happened when it did.

@doruit
Copy link
Author

doruit commented Aug 19, 2024

@timothymeyers, just did a fresh deployment to rule out some possible causes.....

I've checked the storage account, it seems that the files are uploaded to a container with a random name where i expected a name i declared in the notebook:

file_directory = "testdata"
storage_name = "testdata"
index_name = "index1"

However, the files are uploaded in a container with the number as name instead:

image

Is this expected?

@rnpramasamyai
Copy link

@doruit Please check logs of your indexing pod and you will get an idea.

@timothymeyers
Copy link
Contributor

However, the files are uploaded in a container with the number as name instead. Is this expected?

Hi @doruit - yes this is the expected behavior. The names that you give are hashed to improve the overall security posture.

Did you run into the same issues during indexing with your new deployment? Did you happen to try inspecting the index pod logs like I mentioned?

@doruit
Copy link
Author

doruit commented Aug 20, 2024

Hi @timothymeyers, earlier i saw in the indexing pod logs that the token limit is reached many times. To me strange as i'm using the following TPM settings:

image

Should be sufficient right? I have also turned of dynamic quota allocation.

When looking at the monitor of jobs it says no jobs running:

image

When checking the job status from the notebook at the same time it says:

image

@rnpramasamyai
Copy link

@doruit, could you please add the api_key property under each LLM node in the following file: pipeline-settings.yaml?.

@doruit
Copy link
Author

doruit commented Aug 21, 2024

@rnpramasamyai, i've added the api_key property:

image

After this i did a rerun of the Quickstart notebook to build a new index:

image

But now the indexing manager does not seem to instantiate an indexing job at all.

Should i remove the graphrag namespace and run the deployment again ?

@rnpramasamyai
Copy link

@doruit. Please run deployment script again.

@doruit
Copy link
Author

doruit commented Aug 21, 2024

Deployment was successful however indexing is still not working. Should the API version match the value from the deployment documentation or the API version that is mentioned in the Playground > View Code window:

image image

@rnpramasamyai
Copy link

@doruit Please always check the pod's logs if indexing is not working and post that logs.

@doruit
Copy link
Author

doruit commented Aug 22, 2024

I did a full deployment again, check all parameters and ran the notebook again from the start. After running the step "Build an Index" i get this message:

{
    'status_code': 200,
    'index_name': 'index7',
    'storage_name': 'testdata',
    'status': 'scheduled',
    'percent_complete': 0.0,
    'progress': '',
}

At the same time i'm watching the logs and wait for the indexing job to come-by but i only get messages from the graphrag index manager every 5 minutes:

Every 2.0s: kubectl get jobs -...  SandboxHost-638599057829007509: Thu Aug 22 11:25:28 2024

NAME                              COMPLETIONS   DURATION   AGE
graphrag-index-manager-28738765   1/1           25s        28s

This is my parameters file:

{
  "GRAPHRAG_API_BASE": "https://aoai-graphrag-tst-francecentral.openai.azure.com",
  "GRAPHRAG_API_VERSION": "2024-02-15-preview",
  "GRAPHRAG_EMBEDDING_DEPLOYMENT_NAME": "text-embedding-ada-002",
  "GRAPHRAG_EMBEDDING_MODEL": "text-embedding-ada-002",
  "GRAPHRAG_LLM_DEPLOYMENT_NAME": "gpt-4o",
  "GRAPHRAG_LLM_MODEL": "gpt-4o",
  "LOCATION": "francecentral",
  "RESOURCE_GROUP": "rg-graphrag-tst-04"
}

Not sure where to look now as the indexing job does not start at all anymore. What region, LLM model version, API version, etc should i use as reference?

@rnpramasamyai
Copy link

@doruit Indexing will take time to complete.

@doruit
Copy link
Author

doruit commented Aug 22, 2024

@rnpramasamyai, i've waited for an hour, but it seems it will not start nor can i find any clue where to look for errors. In the job log i only see this message every 5 minutes:

image

What else can i check/ rule out?

@doruit
Copy link
Author

doruit commented Aug 23, 2024

@rnpramasamyai @timothymeyers
I changed from a CSP tenant to deploy to my MSDN tenant/subscription, did the full deployment and it now seem to work:

HTTP/1.1 200 OK
content-length: 172
content-type: application/json
date: Fri, 23 Aug 2024 11:53:38 GMT
request-context: appId=cid-v1:xxxxxxxxxxxxxx
vary: Origin
    
{
    "status_code": 200,
    "index_name": "index1",
    "storage_name": "testdata",
    "status": "complete",
    "percent_complete": 100.0,
    "progress": "16 out of 16 workflows completed successfully."
}

I checked if quota or Azure policy caused the issue in the CSP tenant/subscription, however i could not find any log so for to rule everything out.

There is only 1 policy that might impact the creation of VM/VMSS. The policy requires VMs to have managed disks, which they all have so i guess the policy won't block anything. Other policy blocks creating classic resources.

However, the good news is that with the alternative method the deployment was successful.

@eai-douglaswross
Copy link

eai-douglaswross commented Aug 28, 2024

Firstly: thank you for this repo, and thanks for trying to help us punters understand what you have written.

I do have the same issues with stopping at 2/16 workflows. 12.5% .
I do not have a MSDN tenant, however we do not have any policies specifically added into our tenant. It is very new, and out of the box.

Pod log command, does not seem to work??? tried with both names when job was running:

graphrag-solution-accelerator-py3.10vscode@docker-desktop:/graphrag-accelerator$ kubectl logs job/graphrag-index-manager-28746945 -n graphrag -f
Indexing job for 'indtestdata' already running. Will not schedule another. Exiting...
graphrag-solution-accelerator-py3.10vscode@docker-desktop:/graphrag-accelerator$ kubectl logs job/indtestdata -n graphrag -f
error: error from server (NotFound): jobs.batch "indtestdata" not found in namespace "graphrag"

Can I suggest / request - as it may make everyone's job a little easier:

  1. Enable Azure AI Search access from Portal when in DEV deployment more: Can you set some variable in the deployment like deployment_type=<dev/prod> so that all of the blocking of Azure Portal access to the AI search Index is turned off in dev mode, and then locked down for a prod deployment.
  2. Add the option for a VM into deployment of the infra into the private network, so that we can use this method: https://learn.microsoft.com/en-gb/azure/search/service-create-private-endpoint#use-the-azure-portal-to-access-a-private-search-service
  3. Provide some instructions about manually putting a VM in the private network via azure online portal, so we can remote to it and access the Azure Portal functionality like suggested in that link

i.e. It is very difficult to see what is going on, to try to understand what is going wrong.

Lastly, when you add a comment like:

@doruit, could you please add the api_key property under each LLM node in the following file: pipeline-settings.yaml?.

For the rest of us trying to follow along, would you mind telling us quickly why you are suggesting that, so that we can also understand why it might fix the issue.

@doruit
Copy link
Author

doruit commented Aug 28, 2024

I still don't know what caused the process to get stuck. It was not due to Azure policy or api_key in the pipeline-settings.yaml. Perhaps the model and API version might cause the issue. In other issue threads they mention that if the vector size is slightly different from what is expected, the indexing will fail.

@timothymeyers, in deployment.md it looks like the API version is fixed to "2023-03-15-preview". Is that correct? or should the documentation instruct the developer to get the right API version from the deployed model (i.e. via the portal)?

@Daksh-S97
Copy link

Hi, I'm facing a similar issue. I'm running it on a dev container. Which API version did you end up using?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants