Fail to initiation of communication #22

YuMJie · 2024-11-09T05:39:32Z

The initiation of communication will fail if the number of nodes is set to 3.

which occur in get_group

the important of failure is followed: (run on 3 node and each node has 2 GPUs)

Traceback (most recent call last):
  File "pretrain_gpt.py", line 149, in <module>
    pretrain(train_valid_test_datasets_provider, model_provider, forward_step_func,
  File "/workspace/aceso/runtime/megatron/training.py", line 113, in pretrain
    initialize_megatron(extra_args_provider=extra_args_provider,
  File "/workspace/aceso/runtime/megatron/initialize.py", line 86, in initialize_megatron
    finish_mpu_init()
  File "/workspace/aceso/runtime/megatron/initialize.py", line 67, in finish_mpu_init
    _initialize_distributed()
  File "/workspace/aceso/runtime/megatron/initialize.py", line 209, in _initialize_distributed
    mpu.initialize_model_parallel_flexpipe()
  File "/workspace/aceso/runtime/megatron/mpu/initialize.py", line 288, in initialize_model_parallel_flexpipe
    get_group(ranks)    
  File "/workspace/aceso/runtime/megatron/mpu/initialize.py", line 560, in get_group
    group_bits = bitmap(ranks)
  File "/workspace/aceso/runtime/megatron/mpu/initialize.py", line 555, in bitmap
    raise ValueError("rank {} out of range ({})".format(rank, len(bits)))
ValueError: rank 6 out of range (6)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2618839) of binary: /usr/bin/python3
Traceback (most recent call last):

The text was updated successfully, but these errors were encountered:

Vamix · 2024-11-17T04:17:41Z

Thanks for your feedback! Aceso was not tested in clusters of an odd number of nodes, so there may be some code that causes problems in such cluster settings, we will investigate this and update soon.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fail to initiation of communication #22

Fail to initiation of communication #22

YuMJie commented Nov 9, 2024

Vamix commented Nov 17, 2024

Fail to initiation of communication #22

Fail to initiation of communication #22

Comments

YuMJie commented Nov 9, 2024

Vamix commented Nov 17, 2024