Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PaddleOCR Multi-Machine Distributed Training #14443

Closed
ntdgo opened this issue Dec 20, 2024 · 2 comments
Closed

PaddleOCR Multi-Machine Distributed Training #14443

ntdgo opened this issue Dec 20, 2024 · 2 comments
Assignees

Comments

@ntdgo
Copy link

ntdgo commented Dec 20, 2024

请提出你的问题 Please ask your question

Hello,

I am currently working on training an OCR classification model in parallel across two machines and would appreciate some guidance on my setup. Below are the details of my configuration:

I have two computers:
Machine 1:

  • Windows 11
  • GPU: RTX4090
  • Public IP: 212.109.144.125
  • Port open: 6004

Machine 2:

  • Windows 11
  • GPU: RTX3090
  • Public IP: 122.109.144.229

I installed PaddlePaddle using the following command on both machines:
python -m pip install paddlepaddle-gpu==3.0.0b2 -i https://www.paddlepaddle.org.cn/packages/stable/cu118/

The training commands I used were (https://www.paddlepaddle.org.cn/documentation/docs/en/api/paddle/distributed/launch_en.html):

On Machine 1:

python -m paddle.distributed.launch --gpus 0 --master=192.168.1.123:6004 ./PaddleOCR/tools/train.py -c ./configs/cls.yml

On Machine 2:

python -m paddle.distributed.launch --gpus 0 --master=212.109.144.125:6004 ./PaddleOCR/tools/train.py -c ./configs/cls.yml

However, when I start the training, it seems that the two machines are not able to establish a connection and work together as expected. I am wondering if there might be an issue with my setup or the configuration of the training commands.

Could anyone help me identify what might be wrong or suggest how to fix this?

Thank you in advance for your assistance!

@jzhang533 jzhang533 transferred this issue from PaddlePaddle/Paddle Dec 24, 2024
@GreatV
Copy link
Collaborator

GreatV commented Dec 24, 2024

Maybe you need to set the same master IP.

@ntdgo
Copy link
Author

ntdgo commented Dec 25, 2024

Hi @GreatV, I found this PaddlePaddle/Paddle#68480 (comment) help to solve the problem. Thank you for your support!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants