Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot save checkpoint file #230

Closed
racng opened this issue Jul 19, 2023 · 10 comments
Closed

Cannot save checkpoint file #230

racng opened this issue Jul 19, 2023 · 10 comments
Assignees

Comments

@racng
Copy link

racng commented Jul 19, 2023

I am trying out the branch sf_dev_0.3.0_postreg to use the --exclude-feature-types tag.
However, cellbender terminates after the last epoch. It also could not save checkpoint file during the entire training but it didn't cause an error until the very end.
Here is the end of the log file.

cellbender:remove-background: [epoch 150]  average training loss: 8421.9586
cellbender:remove-background: [epoch 150] average test loss: 2003.4138
cellbender:remove-background: Saving a checkpoint...
cellbender:remove-background: Could not save checkpoint
cellbender:remove-background: 2023-07-17 15:37:22
cellbender:remove-background: Inference procedure complete.
Traceback (most recent call last):
  File "/users/rng/proj/single-cell-pipeline/conda/36d1c280b15c072d9ae0c93edf34f94e_/bin/cellbender", line 8, in <module>
    sys.exit(main())
  File "/users/rng/proj/single-cell-pipeline/conda/36d1c280b15c072d9ae0c93edf34f94e_/lib/python3.10/site-packages/cellbender/base_cli.py", line 123, in main
    cli_dict[args.tool].run(args)
  File "/users/rng/proj/single-cell-pipeline/conda/36d1c280b15c072d9ae0c93edf34f94e_/lib/python3.10/site-packages/cellbender/remove_background/cli.py", line 185, in run
    return main(args)
  File "/users/rng/proj/single-cell-pipeline/conda/36d1c280b15c072d9ae0c93edf34f94e_/lib/python3.10/site-packages/cellbender/remove_background/cli.py", line 230, in main
    posterior = run_remove_background(args)
  File "/users/rng/proj/single-cell-pipeline/conda/36d1c280b15c072d9ae0c93edf34f94e_/lib/python3.10/site-packages/cellbender/remove_background/run.py", line 98, in run_remove_background
    posterior = load_or_compute_posterior_and_save(
  File "/users/rng/proj/single-cell-pipeline/conda/36d1c280b15c072d9ae0c93edf34f94e_/lib/python3.10/site-packages/cellbender/remove_background/posterior.py", line 59, in load_or_compute_posterior_and_save
    assert os.path.exists(args.input_checkpoint_tarball), \
AssertionError: Checkpoint file ckpt.tar.gz does not exist, presumably because saving of the checkpoint file has been manually interrupted. load_or_compute_posterior_and_save() will not work properly without an existing checkpoint file. Please re-run and allow a checkpoint file to be saved.

I am using the following environment:

# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                  2_kmp_llvm    conda-forge
anndata                   0.9.1                    pypi_0    pypi
anyio                     3.7.1                    pypi_0    pypi
argon2-cffi               21.3.0                   pypi_0    pypi
argon2-cffi-bindings      21.2.0                   pypi_0    pypi
arrow                     1.2.3                    pypi_0    pypi
asttokens                 2.2.1                    pypi_0    pypi
attrs                     23.1.0                   pypi_0    pypi
backcall                  0.2.0                    pypi_0    pypi
beautifulsoup4            4.12.2                   pypi_0    pypi
bleach                    6.0.0                    pypi_0    pypi
blosc                     1.21.4               h0f2a231_0    conda-forge
brotli-python             1.0.9           py310hd8f1fbe_9    conda-forge
bzip2                     1.0.8                h7f98852_4    conda-forge
c-ares                    1.19.1               hd590300_0    conda-forge
c-blosc2                  2.10.0               hb4ffafa_0    conda-forge
ca-certificates           2023.5.7             hbcca054_0    conda-forge
cellbender                0.3.0                    pypi_0    pypi
certifi                   2023.5.7           pyhd8ed1ab_0    conda-forge
cffi                      1.15.1                   pypi_0    pypi
charset-normalizer        3.2.0              pyhd8ed1ab_0    conda-forge
click                     8.1.5                    pypi_0    pypi
cmake                     3.26.4                   pypi_0    pypi
comm                      0.1.3                    pypi_0    pypi
contourpy                 1.1.0                    pypi_0    pypi
cuda-cudart               11.7.99                       0    nvidia
cuda-cupti                11.7.101                      0    nvidia
cuda-libraries            11.7.1                        0    nvidia
cuda-nvrtc                11.7.99                       0    nvidia
cuda-nvtx                 11.7.91                       0    nvidia
cuda-runtime              11.7.1                        0    nvidia
cuda-version              11.8                 h70ddcb2_2    conda-forge
cudatoolkit               11.8.0              h37601d7_11    conda-forge
cudnn                     8.8.0.121            h0800d71_1    conda-forge
cycler                    0.11.0                   pypi_0    pypi
debugpy                   1.6.7                    pypi_0    pypi
decorator                 5.1.1                    pypi_0    pypi
defusedxml                0.7.1                    pypi_0    pypi
exceptiongroup            1.1.2                    pypi_0    pypi
executing                 1.2.0                    pypi_0    pypi
fastjsonschema            2.17.1                   pypi_0    pypi
filelock                  3.12.2             pyhd8ed1ab_0    conda-forge
fonttools                 4.41.0                   pypi_0    pypi
fqdn                      1.5.1                    pypi_0    pypi
freetype                  2.12.1               hca18f0e_1    conda-forge
gmp                       6.2.1                h58526e2_0    conda-forge
gmpy2                     2.1.2           py310h3ec546c_1    conda-forge
h5py                      3.9.0                    pypi_0    pypi
hdf5                      1.14.1          nompi_h4f84152_100    conda-forge
icu                       72.1                 hcb278e6_0    conda-forge
idna                      3.4                pyhd8ed1ab_0    conda-forge
ipykernel                 6.24.0                   pypi_0    pypi
ipython                   8.14.0                   pypi_0    pypi
ipython-genutils          0.2.0                    pypi_0    pypi
ipywidgets                8.0.7                    pypi_0    pypi
isoduration               20.11.0                  pypi_0    pypi
jedi                      0.18.2                   pypi_0    pypi
jinja2                    3.1.2              pyhd8ed1ab_1    conda-forge
jsonpointer               2.4                      pypi_0    pypi
jsonschema                4.18.3                   pypi_0    pypi
jsonschema-specifications 2023.6.1                 pypi_0    pypi
jupyter                   1.0.0                    pypi_0    pypi
jupyter-client            8.3.0                    pypi_0    pypi
jupyter-console           6.6.3                    pypi_0    pypi
jupyter-contrib-core      0.4.2                    pypi_0    pypi
jupyter-contrib-nbextensions 0.7.0                    pypi_0    pypi
jupyter-core              5.3.1                    pypi_0    pypi
jupyter-events            0.6.3                    pypi_0    pypi
jupyter-highlight-selected-word 0.2.0                    pypi_0    pypi
jupyter-nbextensions-configurator 0.6.3                    pypi_0    pypi
jupyter-server            2.7.0                    pypi_0    pypi
jupyter-server-terminals  0.4.4                    pypi_0    pypi
jupyterlab-pygments       0.2.2                    pypi_0    pypi
jupyterlab-widgets        3.0.8                    pypi_0    pypi
keyutils                  1.6.1                h166bdaf_0    conda-forge
kiwisolver                1.4.4                    pypi_0    pypi
krb5                      1.21.1               h659d440_0    conda-forge
lcms2                     2.15                 haa2dc70_1    conda-forge
ld_impl_linux-64          2.40                 h41732ed_0    conda-forge
lerc                      4.0.0                h27087fc_0    conda-forge
libaec                    1.0.6                hcb278e6_1    conda-forge
libblas                   3.9.0           17_linux64_openblas    conda-forge
libcblas                  3.9.0           17_linux64_openblas    conda-forge
libcublas                 11.10.3.66                    0    nvidia
libcufft                  10.7.2.124           h4fbf590_0    nvidia
libcufile                 1.7.0.149                     0    nvidia
libcurand                 10.3.3.53                     0    nvidia
libcurl                   8.1.2                hca28451_1    conda-forge
libcusolver               11.4.0.1                      0    nvidia
libcusparse               11.7.4.91                     0    nvidia
libdeflate                1.18                 h0b41bf4_0    conda-forge
libedit                   3.1.20191231         he28a2e2_2    conda-forge
libev                     4.33                 h516909a_1    conda-forge
libffi                    3.4.2                h7f98852_5    conda-forge
libgcc-ng                 13.1.0               he5830b7_0    conda-forge
libgfortran-ng            13.1.0               h69a702a_0    conda-forge
libgfortran5              13.1.0               h15d22d2_0    conda-forge
libhwloc                  2.9.1           nocuda_h7313eea_6    conda-forge
libiconv                  1.17                 h166bdaf_0    conda-forge
libjpeg-turbo             2.1.5.1              h0b41bf4_0    conda-forge
liblapack                 3.9.0           17_linux64_openblas    conda-forge
libmagma                  2.7.1                hc72dce7_3    conda-forge
libmagma_sparse           2.7.1                hc72dce7_4    conda-forge
libnghttp2                1.52.0               h61bc06f_0    conda-forge
libnpp                    11.7.4.75                     0    nvidia
libnsl                    2.0.0                h7f98852_0    conda-forge
libnvjpeg                 11.8.0.2                      0    nvidia
libopenblas               0.3.23          pthreads_h80387f5_0    conda-forge
libpng                    1.6.39               h753d276_0    conda-forge
libprotobuf               3.21.12              h3eb15da_0    conda-forge
libsqlite                 3.42.0               h2797004_0    conda-forge
libssh2                   1.11.0               h0841786_0    conda-forge
libstdcxx-ng              13.1.0               hfd8a6a1_0    conda-forge
libtiff                   4.5.1                h8b53f26_0    conda-forge
libuuid                   2.38.1               h0b41bf4_0    conda-forge
libwebp-base              1.3.1                hd590300_0    conda-forge
libxcb                    1.15                 h0b41bf4_0    conda-forge
libxml2                   2.11.4               h0d562d8_0    conda-forge
libzlib                   1.2.13               hd590300_5    conda-forge
lit                       16.0.6                   pypi_0    pypi
llvm-openmp               16.0.6               h4dfa4b3_0    conda-forge
llvmlite                  0.40.1                   pypi_0    pypi
loompy                    3.0.7                    pypi_0    pypi
lxml                      4.9.3                    pypi_0    pypi
lz4-c                     1.9.4                hcb278e6_0    conda-forge
lzo                       2.10              h516909a_1000    conda-forge
magma                     2.7.1                ha770c72_4    conda-forge
markupsafe                2.1.3           py310h2372a71_0    conda-forge
matplotlib                3.7.2                    pypi_0    pypi
matplotlib-inline         0.1.6                    pypi_0    pypi
mistune                   3.0.1                    pypi_0    pypi
mkl                       2022.2.1         h84fe81f_16997    conda-forge
mpc                       1.3.1                hfe3b2da_0    conda-forge
mpfr                      4.2.0                hb012696_0    conda-forge
mpmath                    1.3.0              pyhd8ed1ab_0    conda-forge
natsort                   8.4.0                    pypi_0    pypi
nbclassic                 1.0.0                    pypi_0    pypi
nbclient                  0.8.0                    pypi_0    pypi
nbconvert                 7.7.1                    pypi_0    pypi
nbformat                  5.9.1                    pypi_0    pypi
nccl                      2.18.3.1             h12f7317_0    conda-forge
ncurses                   6.4                  hcb278e6_0    conda-forge
nest-asyncio              1.5.6                    pypi_0    pypi
networkx                  3.1                pyhd8ed1ab_0    conda-forge
notebook                  6.5.4                    pypi_0    pypi
notebook-shim             0.2.3                    pypi_0    pypi
numba                     0.57.1                   pypi_0    pypi
numexpr                   2.7.3           py310hb5077e9_1    conda-forge
numpy                     1.24.4                   pypi_0    pypi
numpy-groupies            0.9.22                   pypi_0    pypi
nvidia-cublas-cu11        11.10.3.66               pypi_0    pypi
nvidia-cuda-cupti-cu11    11.7.101                 pypi_0    pypi
nvidia-cuda-nvrtc-cu11    11.7.99                  pypi_0    pypi
nvidia-cuda-runtime-cu11  11.7.99                  pypi_0    pypi
nvidia-cudnn-cu11         8.5.0.96                 pypi_0    pypi
nvidia-cufft-cu11         10.9.0.58                pypi_0    pypi
nvidia-curand-cu11        10.2.10.91               pypi_0    pypi
nvidia-cusolver-cu11      11.4.0.1                 pypi_0    pypi
nvidia-cusparse-cu11      11.7.4.91                pypi_0    pypi
nvidia-nccl-cu11          2.14.3                   pypi_0    pypi
nvidia-nvtx-cu11          11.7.91                  pypi_0    pypi
openjpeg                  2.5.0                hfec8fc6_2    conda-forge
openssl                   3.1.1                hd590300_1    conda-forge
opt-einsum                3.3.0                    pypi_0    pypi
overrides                 7.3.1                    pypi_0    pypi
packaging                 23.1               pyhd8ed1ab_0    conda-forge
pandas                    2.0.3                    pypi_0    pypi
pandocfilters             1.5.0                    pypi_0    pypi
parso                     0.8.3                    pypi_0    pypi
pexpect                   4.8.0                    pypi_0    pypi
pickleshare               0.7.5                    pypi_0    pypi
pillow                    10.0.0          py310h582fbeb_0    conda-forge
pip                       23.2               pyhd8ed1ab_0    conda-forge
platformdirs              3.9.1                    pypi_0    pypi
prometheus-client         0.17.1                   pypi_0    pypi
prompt-toolkit            3.0.39                   pypi_0    pypi
psutil                    5.9.5                    pypi_0    pypi
pthread-stubs             0.4               h36c2ea0_1001    conda-forge
ptyprocess                0.7.0                    pypi_0    pypi
pure-eval                 0.2.2                    pypi_0    pypi
py-cpuinfo                9.0.0              pyhd8ed1ab_0    conda-forge
pycparser                 2.21                     pypi_0    pypi
pygments                  2.15.1                   pypi_0    pypi
pyparsing                 3.0.9                    pypi_0    pypi
pyro-api                  0.1.2                    pypi_0    pypi
pyro-ppl                  1.8.5                    pypi_0    pypi
pysocks                   1.7.1              pyha2e5f31_6    conda-forge
pytables                  3.8.0           py310ha028ce3_2    conda-forge
python                    3.10.12         hd12c33a_0_cpython    conda-forge
python-dateutil           2.8.2                    pypi_0    pypi
python-json-logger        2.0.7                    pypi_0    pypi
python_abi                3.10                    3_cp310    conda-forge
pytorch-cuda              11.7                 h778d358_5    pytorch
pytorch-mutex             1.0                        cuda    pytorch
pytz                      2023.3                   pypi_0    pypi
pyyaml                    6.0                      pypi_0    pypi
pyzmq                     25.1.0                   pypi_0    pypi
qtconsole                 5.4.3                    pypi_0    pypi
qtpy                      2.3.1                    pypi_0    pypi
readline                  8.2                  h8228510_1    conda-forge
referencing               0.29.1                   pypi_0    pypi
requests                  2.31.0             pyhd8ed1ab_0    conda-forge
rfc3339-validator         0.1.4                    pypi_0    pypi
rfc3986-validator         0.1.1                    pypi_0    pypi
rpds-py                   0.8.11                   pypi_0    pypi
scipy                     1.11.1                   pypi_0    pypi
send2trash                1.8.2                    pypi_0    pypi
setuptools                68.0.0             pyhd8ed1ab_0    conda-forge
six                       1.16.0                   pypi_0    pypi
sleef                     3.5.1                h9b69904_2    conda-forge
snappy                    1.1.10               h9fff704_0    conda-forge
sniffio                   1.3.0                    pypi_0    pypi
soupsieve                 2.4.1                    pypi_0    pypi
stack-data                0.6.2                    pypi_0    pypi
sympy                     1.12            pypyh9d50eac_103    conda-forge
tbb                       2021.9.0             hf52228f_0    conda-forge
terminado                 0.17.1                   pypi_0    pypi
tinycss2                  1.2.1                    pypi_0    pypi
tk                        8.6.12               h27826a3_0    conda-forge
torch                     2.0.1                    pypi_0    pypi
torchaudio                2.0.0               py310_cu117    pytorch
torchvision               0.15.2          cuda112py310h0801bf5_1    conda-forge
tornado                   6.3.2                    pypi_0    pypi
tqdm                      4.65.0                   pypi_0    pypi
traitlets                 5.9.0                    pypi_0    pypi
triton                    2.0.0                    pypi_0    pypi
typing_extensions         4.7.1              pyha770c72_0    conda-forge
tzdata                    2023.3                   pypi_0    pypi
uri-template              1.3.0                    pypi_0    pypi
urllib3                   2.0.3              pyhd8ed1ab_1    conda-forge
wcwidth                   0.2.6                    pypi_0    pypi
webcolors                 1.13                     pypi_0    pypi
webencodings              0.5.1                    pypi_0    pypi
websocket-client          1.6.1                    pypi_0    pypi
wheel                     0.40.0             pyhd8ed1ab_1    conda-forge
widgetsnbextension        4.0.8                    pypi_0    pypi
xorg-libxau               1.0.11               hd590300_0    conda-forge
xorg-libxdmcp             1.1.3                h7f98852_0    conda-forge
xz                        5.2.6                h166bdaf_0    conda-forge
zlib-ng                   2.0.7                h0b41bf4_0    conda-forge
zstd                      1.5.2                hfc55251_7    conda-forge
@sjfleming
Copy link
Member

There have been recent changes to the sf_dev_0.3.0_postreg branch. Can you try updating and run it again?

It should now print an error message when the checkpoint file fails to save. Hopefully you'll be able to see why.

The behavior is to save the checkpoint file in the folder from which you run the cellbender command. So you will need permission to save a file in that directory. That could be why the saving is failing. But hopefully you'll see for sure if you re-run with the updated version and take a look at the error message.

@sjfleming sjfleming self-assigned this Aug 7, 2023
@chris-rands
Copy link

chris-rands commented Aug 10, 2023

Hi @sjfleming, first congrats on the new release and publication! I am testing 0.3 and I get the AssertionError as above on multiple samples. As you indicate, I am getting a more detailed traceback:

cellbender:remove-background: Could not save checkpoint
cellbender:remove-background: Traceback (most recent call last):
  File "/path/lib/python3.9/site-packages/cellben
der/remove_background/checkpoint.py", line 115, in save_checkpoint
    torch.save(model_obj, filebase + '_model.torch')
  File "/path/lib/python3.9/site-packages/torch/serialization.py", line 441, in save
    _save(obj, opened_zipfile, pickle_module, pickle_protocol)
  File "/path/lib/python3.9/site-packages/torch/serialization.py", line 653, in _save
    pickler.dump(obj)
TypeError: cannot pickle 'weakref' object

Does this help diagnose the issue or do you have any suggested workarounds?

EDIT: sorry I missed #212 - do you think I can save this via specifying a different torch version? (currently torch.__version__ is '2.0.1+cu117' and I'm using Python 3.9 obviously)

EDIT 2: using Python 3.7 with torch==1.13.1 seems to be a work-around

@sjfleming
Copy link
Member

Oh yes! You are exactly right. This tool currently requires python 3.7, and python 3.7 limits the pytorch version to < 2.

It is precisely this annoying problem that I was struggling to fix when going to pytorch 2+ #203

@sjfleming
Copy link
Member

Actually, now that I look back at #203, it might be time to try again, most of those underlying issues have been solved by the pyro team. I do eventually want people to be able to use pytorch 2+

@gvogler
Copy link

gvogler commented Aug 11, 2023

I am having the exact same issue in Google Colab which uses Python 3.10.12. Reverting to 3.7 seems to be excessively difficult in that environment.

@sjfleming
Copy link
Member

@gvogler yeah I really want people to be able to use Google Colab. That was one of the advantages of checkpointing (I was hoping): if you get booted, but you save the checkpoint file somewhere stable (can't you sync with google drive somehow?) then training will pick back up where it left off.

Didn't realize it was hard to run python 3.7 in colab. I need to experiment a bit there.

Do you know if it's feasible to run something in a docker image on google colab?

@sjfleming
Copy link
Member

Alright @gvogler , I have made an attempt to have a working demo on Google Colab. I really do want people to be able to use Colab! Here it is:

https://gist.github.com/sjfleming/a2ec2bc1d1fcc3ff75600cec2395636c

@gvogler
Copy link

gvogler commented Aug 14, 2023

I can confirm it is processing the data. Thanks for finding this workaround!

@racng
Copy link
Author

racng commented Aug 22, 2023

Hi I no longer have problems saving checkpoint file when using the latest released cellbender v0.3.0. Thank you for fixing the issue! I followed your advice to downgraded my python to 3.7. Feel free to close this issue!

Here are the packages I used:

python                    3.7.12          hf930737_100_cpython    conda-forge
pytorch                   1.12.1          cuda112py37h6a44366_201    conda-forge
pytorch-cuda              11.7                 h778d358_5    pytorch
pytorch-mutex             1.0                        cuda    pytorch
torchaudio                0.12.1               py37_cu116    pytorch
torchvision               0.13.0          cuda112py37h67e586c_0    conda-forge
cellbender                0.3.0                    pypi_0    pypi
pytables                  3.7.0            py37h7d129aa_2    conda-forge

@sjfleming
Copy link
Member

Great news @racng !

I will close this issue and follow up on the efforts toward porting the project to python > 3.7 in #203

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants