Skip to content

Problem in multi-node training #7275

Apr 29, 2021 · 3 comments · 15 replies
Discussion options

You must be logged in to vote

Hello Nikos

Do you have 8 gpus in the node? I think it must match gres.
Don't you also need to specify how many tasks per node in the SBATCH directive? [1]
Also, I notice some unsupported Trainer arguments in your script. It should be:
trainer = Trainer(max_epochs=1, gpus=[0, 1, 2, 3, 4, 5, 6, 7], num_nodes=4)
Make sure this script actually runs on CPU first before going to the cluster 😅

Totally no slurm expert here, just looking at your script with one eye closed.

[1] https://pytorch-lightning.readthedocs.io/en/latest/clouds/cluster.html#slurm-managed-cluster

Replies: 3 comments 15 replies

Comment options

You must be logged in to vote
14 replies
@awaelchli
Comment options

@nickKyr
Comment options

@awaelchli
Comment options

@nickKyr
Comment options

@awaelchli
Comment options

Answer selected by awaelchli
Comment options

You must be logged in to vote
1 reply
@spirosbax
Comment options

Comment options

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment