-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot save checkpoint file #230
Comments
There have been recent changes to the It should now print an error message when the checkpoint file fails to save. Hopefully you'll be able to see why. The behavior is to save the checkpoint file in the folder from which you run the cellbender command. So you will need permission to save a file in that directory. That could be why the saving is failing. But hopefully you'll see for sure if you re-run with the updated version and take a look at the error message. |
Hi @sjfleming, first congrats on the new release and publication! I am testing 0.3 and I get the
Does this help diagnose the issue or do you have any suggested workarounds? EDIT: sorry I missed #212 - do you think I can save this via specifying a different torch version? (currently EDIT 2: using Python 3.7 with |
Oh yes! You are exactly right. This tool currently requires python 3.7, and python 3.7 limits the pytorch version to < 2. It is precisely this annoying problem that I was struggling to fix when going to pytorch 2+ #203 |
Actually, now that I look back at #203, it might be time to try again, most of those underlying issues have been solved by the pyro team. I do eventually want people to be able to use pytorch 2+ |
I am having the exact same issue in Google Colab which uses Python 3.10.12. Reverting to 3.7 seems to be excessively difficult in that environment. |
@gvogler yeah I really want people to be able to use Google Colab. That was one of the advantages of checkpointing (I was hoping): if you get booted, but you save the checkpoint file somewhere stable (can't you sync with google drive somehow?) then training will pick back up where it left off. Didn't realize it was hard to run python 3.7 in colab. I need to experiment a bit there. Do you know if it's feasible to run something in a docker image on google colab? |
Alright @gvogler , I have made an attempt to have a working demo on Google Colab. I really do want people to be able to use Colab! Here it is: https://gist.github.com/sjfleming/a2ec2bc1d1fcc3ff75600cec2395636c |
I can confirm it is processing the data. Thanks for finding this workaround! |
Hi I no longer have problems saving checkpoint file when using the latest released cellbender v0.3.0. Thank you for fixing the issue! I followed your advice to downgraded my python to 3.7. Feel free to close this issue! Here are the packages I used:
|
I am trying out the branch
sf_dev_0.3.0_postreg
to use the--exclude-feature-types
tag.However, cellbender terminates after the last epoch. It also could not save checkpoint file during the entire training but it didn't cause an error until the very end.
Here is the end of the log file.
I am using the following environment:
The text was updated successfully, but these errors were encountered: