PaliGemma fine-tuning - error with distributed training #20496

vdrvar · 2024-12-13T13:18:09Z

Bug description

I'm having an issue while adapting the fine-tuning logic from this HF tutorial:

https://github.com/NielsRogge/Transformers-Tutorials/blob/master/PaliGemma/Fine_tune_PaliGemma_for_image_%3EJSON.ipynb

I don't seem to be able to run distributed training on multiple gpus, when I run the training script with a config that includes gpus 0 and 1, I'm getting a Segmentation fault (core dumped) error. I am using Q-Lora also.

Please advise.

What version are you seeing the problem on?

master

How to reproduce the bug

# Create trainer
    trainer = L.Trainer(
        accelerator="gpu",
        devices=[0,1],  # Use devices from config
        strategy="ddp",
        ...
    )

Error messages and logs

`low_cpu_mem_usage` was None, now default to True since model is quantized.
Downloading shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 9709.04it/s]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:02<00:00,  1.17s/it]
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 2 processes
----------------------------------------------------------------------------------------------------

Segmentation fault (core dumped)

Environment

pyproject.toml:

transformers = "^4.44.2"
torch = "^2.4.1"
lightning = "^2.4.0"
peft = "^0.13.2"
accelerate = "^1.1.1"
bitsandbytes = "^0.45.0"

More info

No response

The text was updated successfully, but these errors were encountered:

vdrvar added bug Something isn't working needs triage Waiting to be triaged by maintainers labels Dec 13, 2024

github-actions bot added the ver: 2.4.x label Dec 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PaliGemma fine-tuning - error with distributed training #20496

PaliGemma fine-tuning - error with distributed training #20496

vdrvar commented Dec 13, 2024

PaliGemma fine-tuning - error with distributed training #20496

PaliGemma fine-tuning - error with distributed training #20496

Comments

vdrvar commented Dec 13, 2024

Bug description

What version are you seeing the problem on?

How to reproduce the bug

Error messages and logs

Environment

More info