Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] pdsh runner doesn't work with tqdm bar #6978

Open
Superskyyy opened this issue Jan 29, 2025 · 3 comments
Open

[BUG] pdsh runner doesn't work with tqdm bar #6978

Superskyyy opened this issue Jan 29, 2025 · 3 comments
Labels
bug Something isn't working training

Comments

@Superskyyy
Copy link

Describe the bug
Training tqdm bar won't show in the rank0 console if pdsh is used as the launcher.

To Reproduce
Run any multi-node training on an example script, the tqdm bar from local rank (localhost) won't show up but show as a blank line.

I assume the root cause it pdsh tries to ssh into even localhost, then the bar cannot be logged through a ssh session somehow.

Expected behavior
Be able to see the tqdm bar.

System info (please complete the following information):

  • OS: Ubuntu 22.04
  • x8 GPUs each node
  • Python version 3.12.8
  • 4 Nodes, same network

Launcher context
PDSH

@Superskyyy Superskyyy added bug Something isn't working training labels Jan 29, 2025
@Superskyyy
Copy link
Author

@loadams
Copy link
Contributor

loadams commented Jan 31, 2025

Hi @Superskyyy - do you have a sample repro that I can run to use as a starting point for debug?

@Superskyyy
Copy link
Author

Hi @Superskyyy - do you have a sample repro that I can run to use as a starting point for debug?

I guess you can directly use accelerate + deepspeed to run the ppo example in the official TRL repo. https://github.com/huggingface/trl/tree/main/examples/scripts/ppo but i suppose this problem will exist on all deepspeed + pdsh based training code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working training
Projects
None yet
Development

No branches or pull requests

2 participants