Skip to content
This repository has been archived by the owner on Feb 6, 2025. It is now read-only.

Fail to initiation of communication #22

Open
YuMJie opened this issue Nov 9, 2024 · 1 comment
Open

Fail to initiation of communication #22

YuMJie opened this issue Nov 9, 2024 · 1 comment

Comments

@YuMJie
Copy link

YuMJie commented Nov 9, 2024

The initiation of communication will fail if the number of nodes is set to 3.

which occur in get_group

the important of failure is followed: (run on 3 node and each node has 2 GPUs)

Traceback (most recent call last):
  File "pretrain_gpt.py", line 149, in <module>
    pretrain(train_valid_test_datasets_provider, model_provider, forward_step_func,
  File "/workspace/aceso/runtime/megatron/training.py", line 113, in pretrain
    initialize_megatron(extra_args_provider=extra_args_provider,
  File "/workspace/aceso/runtime/megatron/initialize.py", line 86, in initialize_megatron
    finish_mpu_init()
  File "/workspace/aceso/runtime/megatron/initialize.py", line 67, in finish_mpu_init
    _initialize_distributed()
  File "/workspace/aceso/runtime/megatron/initialize.py", line 209, in _initialize_distributed
    mpu.initialize_model_parallel_flexpipe()
  File "/workspace/aceso/runtime/megatron/mpu/initialize.py", line 288, in initialize_model_parallel_flexpipe
    get_group(ranks)    
  File "/workspace/aceso/runtime/megatron/mpu/initialize.py", line 560, in get_group
    group_bits = bitmap(ranks)
  File "/workspace/aceso/runtime/megatron/mpu/initialize.py", line 555, in bitmap
    raise ValueError("rank {} out of range ({})".format(rank, len(bits)))
ValueError: rank 6 out of range (6)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2618839) of binary: /usr/bin/python3
Traceback (most recent call last):
@Vamix
Copy link

Vamix commented Nov 17, 2024

Thanks for your feedback! Aceso was not tested in clusters of an odd number of nodes, so there may be some code that causes problems in such cluster settings, we will investigate this and update soon.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants