You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Feb 6, 2025. It is now read-only.
the important of failure is followed: (run on 3 node and each node has 2 GPUs)
Traceback (most recent call last):
File "pretrain_gpt.py", line 149, in <module>
pretrain(train_valid_test_datasets_provider, model_provider, forward_step_func,
File "/workspace/aceso/runtime/megatron/training.py", line 113, in pretrain
initialize_megatron(extra_args_provider=extra_args_provider,
File "/workspace/aceso/runtime/megatron/initialize.py", line 86, in initialize_megatron
finish_mpu_init()
File "/workspace/aceso/runtime/megatron/initialize.py", line 67, in finish_mpu_init
_initialize_distributed()
File "/workspace/aceso/runtime/megatron/initialize.py", line 209, in _initialize_distributed
mpu.initialize_model_parallel_flexpipe()
File "/workspace/aceso/runtime/megatron/mpu/initialize.py", line 288, in initialize_model_parallel_flexpipe
get_group(ranks)
File "/workspace/aceso/runtime/megatron/mpu/initialize.py", line 560, in get_group
group_bits = bitmap(ranks)
File "/workspace/aceso/runtime/megatron/mpu/initialize.py", line 555, in bitmap
raise ValueError("rank {} out of range ({})".format(rank, len(bits)))
ValueError: rank 6 out of range (6)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2618839) of binary: /usr/bin/python3
Traceback (most recent call last):
The text was updated successfully, but these errors were encountered:
Thanks for your feedback! Aceso was not tested in clusters of an odd number of nodes, so there may be some code that causes problems in such cluster settings, we will investigate this and update soon.
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
The initiation of communication will fail if the number of nodes is set to 3.
which occur in get_group
the important of failure is followed: (run on 3 node and each node has 2 GPUs)
The text was updated successfully, but these errors were encountered: