Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PaliGemma fine-tuning - error with distributed training #20496

Open
vdrvar opened this issue Dec 13, 2024 · 0 comments
Open

PaliGemma fine-tuning - error with distributed training #20496

vdrvar opened this issue Dec 13, 2024 · 0 comments
Labels
bug Something isn't working needs triage Waiting to be triaged by maintainers ver: 2.4.x

Comments

@vdrvar
Copy link

vdrvar commented Dec 13, 2024

Bug description

I'm having an issue while adapting the fine-tuning logic from this HF tutorial:

https://github.com/NielsRogge/Transformers-Tutorials/blob/master/PaliGemma/Fine_tune_PaliGemma_for_image_%3EJSON.ipynb

I don't seem to be able to run distributed training on multiple gpus, when I run the training script with a config that includes gpus 0 and 1, I'm getting a Segmentation fault (core dumped) error. I am using Q-Lora also.

Please advise.

What version are you seeing the problem on?

master

How to reproduce the bug

# Create trainer
    trainer = L.Trainer(
        accelerator="gpu",
        devices=[0,1],  # Use devices from config
        strategy="ddp",
        ...
    )

Error messages and logs

`low_cpu_mem_usage` was None, now default to True since model is quantized.
Downloading shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 9709.04it/s]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:02<00:00,  1.17s/it]
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 2 processes
----------------------------------------------------------------------------------------------------

Segmentation fault (core dumped)

Environment

pyproject.toml:

transformers = "^4.44.2"
torch = "^2.4.1"
lightning = "^2.4.0"
peft = "^0.13.2"
accelerate = "^1.1.1"
bitsandbytes = "^0.45.0"

More info

No response

@vdrvar vdrvar added bug Something isn't working needs triage Waiting to be triaged by maintainers labels Dec 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Waiting to be triaged by maintainers ver: 2.4.x
Projects
None yet
Development

No branches or pull requests

1 participant