Cannot save checkpoint file #230

racng opened this issue Jul 19, 2023 · 10 comments

racng opened this issue Jul 19, 2023 · 10 comments


racng commented Jul 19, 2023

I am trying out the branch sf_dev_0.3.0_postreg to use the --exclude-feature-types tag.
However, cellbender terminates after the last epoch. It also could not save checkpoint file during the entire training but it didn't cause an error until the very end.
Here is the end of the log file.

cellbender:remove-background: [epoch 150]  average training loss: 8421.9586
cellbender:remove-background: [epoch 150] average test loss: 2003.4138
cellbender:remove-background: Saving a checkpoint...
cellbender:remove-background: Could not save checkpoint
cellbender:remove-background: 2023-07-17 15:37:22
cellbender:remove-background: Inference procedure complete.
Traceback (most recent call last):
  File "/users/rng/proj/single-cell-pipeline/conda/36d1c280b15c072d9ae0c93edf34f94e_/bin/cellbender", line 8, in <module>
  File "/users/rng/proj/single-cell-pipeline/conda/36d1c280b15c072d9ae0c93edf34f94e_/lib/python3.10/site-packages/cellbender/", line 123, in main
  File "/users/rng/proj/single-cell-pipeline/conda/36d1c280b15c072d9ae0c93edf34f94e_/lib/python3.10/site-packages/cellbender/remove_background/", line 185, in run
    return main(args)
  File "/users/rng/proj/single-cell-pipeline/conda/36d1c280b15c072d9ae0c93edf34f94e_/lib/python3.10/site-packages/cellbender/remove_background/", line 230, in main
    posterior = run_remove_background(args)
  File "/users/rng/proj/single-cell-pipeline/conda/36d1c280b15c072d9ae0c93edf34f94e_/lib/python3.10/site-packages/cellbender/remove_background/", line 98, in run_remove_background
    posterior = load_or_compute_posterior_and_save(
  File "/users/rng/proj/single-cell-pipeline/conda/36d1c280b15c072d9ae0c93edf34f94e_/lib/python3.10/site-packages/cellbender/remove_background/", line 59, in load_or_compute_posterior_and_save
    assert os.path.exists(args.input_checkpoint_tarball), \
AssertionError: Checkpoint file ckpt.tar.gz does not exist, presumably because saving of the checkpoint file has been manually interrupted. load_or_compute_posterior_and_save() will not work properly without an existing checkpoint file. Please re-run and allow a checkpoint file to be saved.

I am using the following environment:

There have been recent changes to the sf_dev_0.3.0_postreg branch. Can you try updating and run it again?

It should now print an error message when the checkpoint file fails to save. Hopefully you'll be able to see why.

The behavior is to save the checkpoint file in the folder from which you run the cellbender command. So you will need permission to save a file in that directory. That could be why the saving is failing. But hopefully you'll see for sure if you re-run with the updated version and take a look at the error message.

@sjfleming sjfleming self-assigned this Aug 7, 2023
chris-rands commented Aug 10, 2023

Hi @sjfleming, first congrats on the new release and publication! I am testing 0.3 and I get the AssertionError as above on multiple samples. As you indicate, I am getting a more detailed traceback:

cellbender:remove-background: Could not save checkpoint
cellbender:remove-background: Traceback (most recent call last):
  File "/path/lib/python3.9/site-packages/cellben
der/remove_background/", line 115, in save_checkpoint, filebase + '_model.torch')
  File "/path/lib/python3.9/site-packages/torch/", line 441, in save
    _save(obj, opened_zipfile, pickle_module, pickle_protocol)
  File "/path/lib/python3.9/site-packages/torch/", line 653, in _save
TypeError: cannot pickle 'weakref' object

Does this help diagnose the issue or do you have any suggested workarounds?

EDIT: sorry I missed #212 - do you think I can save this via specifying a different torch version? (currently torch.__version__ is '2.0.1+cu117' and I'm using Python 3.9 obviously)

EDIT 2: using Python 3.7 with torch==1.13.1 seems to be a work-around

Oh yes! You are exactly right. This tool currently requires python 3.7, and python 3.7 limits the pytorch version to < 2.

It is precisely this annoying problem that I was struggling to fix when going to pytorch 2+ #203

Actually, now that I look back at #203, it might be time to try again, most of those underlying issues have been solved by the pyro team. I do eventually want people to be able to use pytorch 2+

gvogler commented Aug 11, 2023

I am having the exact same issue in Google Colab which uses Python 3.10.12. Reverting to 3.7 seems to be excessively difficult in that environment.

Copy link

@gvogler yeah I really want people to be able to use Google Colab. That was one of the advantages of checkpointing (I was hoping): if you get booted, but you save the checkpoint file somewhere stable (can't you sync with google drive somehow?) then training will pick back up where it left off.

Didn't realize it was hard to run python 3.7 in colab. I need to experiment a bit there.

Do you know if it's feasible to run something in a docker image on google colab?

Alright @gvogler , I have made an attempt to have a working demo on Google Colab. I really do want people to be able to use Colab! Here it is:

Copy link

gvogler commented Aug 14, 2023

I can confirm it is processing the data. Thanks for finding this workaround!

racng commented Aug 22, 2023

Hi I no longer have problems saving checkpoint file when using the latest released cellbender v0.3.0. Thank you for fixing the issue! I followed your advice to downgraded my python to 3.7. Feel free to close this issue!

Here are the packages I used:

python                    3.7.12          hf930737_100_cpython    conda-forge
pytorch                   1.12.1          cuda112py37h6a44366_201    conda-forge
pytorch-cuda              11.7                 h778d358_5    pytorch
pytorch-mutex             1.0                        cuda    pytorch
torchaudio                0.12.1               py37_cu116    pytorch
torchvision               0.13.0          cuda112py37h67e586c_0    conda-forge
cellbender                0.3.0                    pypi_0    pypi
pytables                  3.7.0            py37h7d129aa_2    conda-forge

Great news @racng !

I will close this issue and follow up on the efforts toward porting the project to python > 3.7 in #203

