-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Array has been deleted #626
Comments
Hi David, Can you please try 0.4.7, which might help. |
Hi Niket, I tried upgrading to 0.4.7, and I'm still hitting the error. The only difference is that the error now crashes the training loop, whereas previously training would continue. If I do wait=True, the error disappears, but obviously that's not performant. Do I need to pass in |
I did a bit more debugging. I think the error occurs when aggregate=True in the save_args. Reading the code, it seems like in the aggregate=False branch, the Note that aggregate=True is automatically set by |
Thanks for debugging the issue and associating it with SaveArgs.aggregate option! While we recreate the issue in our dev setup, please switch to aggregate=False if that works for your use case. SaveArgs.aggregate is mainly meant for performance optimization. I hope that unblocks you. To recreate the issue, I will need your help:
Can you please share a code snippet or pointer which details the Orbax usage, so that I can recreate the issue and resolve it? |
Hi, we are trying out the orbax (0.4.1) AsyncCheckpointer (used through CheckpointManager). We are getting "Array has been deleted" errors. It seems as if the async checkpointer is trying to copy a jax.Array from device to memory, but that array is no longer available. The Orbax documentations says that "From start to finish, async checkpointing for a train state of arrays works by first performing a blocking copy of the arrays from device to host", but I wonder if there any gotchas in how we should use orbax checkpointing.
Here is the stack trace:
The text was updated successfully, but these errors were encountered: