Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setting dp_shard > 1 get incorrect number of dp_rank #58

Open
new5558 opened this issue Nov 27, 2024 · 1 comment
Open

Setting dp_shard > 1 get incorrect number of dp_rank #58

new5558 opened this issue Nov 27, 2024 · 1 comment

Comments

@new5558
Copy link

new5558 commented Nov 27, 2024

I am trying to run the debug tutorial on a multimode SLURM cluster (8 nodes, 4 GPUs per node). Setting dp_shard = 4, tp_size = 1, dp_replicate = 8 and get this error:

.......lingua/data.py", line 506, in distribute_data_to_rank
[rank30]:     return rank_to_jsonl_iterator_params[rank]
[rank30]:            ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^
[rank30]: IndexError: list index out of range

return rank_to_jsonl_iterator_params[rank]

At first, I thought the reason was that my rank_to_jsonl_iterator_params length was less than my number of running processes, but I later found that by printing rank, rank can go up to 59 which is even higher than my process count dp_degree = 32.

After some investigation, It may be caused by this code block in train.py

# build dataloader
# need dp world size and rank
dp_mesh = world_mesh["dp_replicate"]
dp_degree = dp_mesh.size() # 8
dp_rank = dp_mesh.get_local_rank() # [0-7]
if args.distributed.dp_shard > 1:
  dp_rank = dp_rank * dp_degree + world_mesh["dp_shard"].get_local_rank() # [0-7] * 8 + [0-3] = [0-59]
  dp_degree *= world_mesh["dp_shard"].size() # 8 * 4 = 32
logger.info(f"Running on dp rank : {dp_rank}") # [0-59]
logger.info(f"Running on dp size : {dp_degree}") # 32

dp_rank = dp_rank * dp_degree + world_mesh["dp_shard"].get_local_rank()

So here is my question, is it normal that dp_rank can be higher than dp_size or am I misunderstanding something? Thank you

@Hannibal046
Copy link

Same question. I believe it should be changed from

dp_rank = dp_rank * dp_degree + world_mesh["dp_shard"].get_local_rank()

to

dp_rank = dp_rank * world_mesh["dp_shard"].size() + world_mesh["dp_shard"].get_local_rank() 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants