Multi-GPU training errors with peft #581

AliengirlLiv · 2023-11-20T21:51:57Z

🐛 Describe the bug

When I try to use multi-gpu training with accelerate I get an error.

Code:

import trlx
from peft import LoraConfig, TaskType
from trlx.data.configs import (
    ModelConfig,
    OptimizerConfig,
    SchedulerConfig,
    TokenizerConfig,
    TrainConfig,
    TRLConfig,
)
from trlx.models.modeling_ppo import PPOConfig

config = TRLConfig(
    train=TrainConfig(
        seq_length=1024,
        epochs=50,
        total_steps=100000,
        batch_size=1,
        checkpoint_interval=1000,
        eval_interval=200,
        pipeline="PromptPipeline",
        trainer="AcceleratePPOTrainer",
    ),
        model=ModelConfig(model_path='gpt2',
                          num_layers_unfrozen=1,
                        # peft_config={"peft_type": "LORA", "r": 1, "lora_alpha": 32, "lora_dropout": 0.1},
                        ),
        tokenizer=TokenizerConfig(tokenizer_path='gpt2', truncation_side="right"),
        optimizer=OptimizerConfig(name="adamw"),
    scheduler=SchedulerConfig(name="cosine_annealing", kwargs={"T_max": 100000, "eta_min": 5.0e-6},),
    method=PPOConfig(
        name="PPOConfig",
        num_rollouts=128,
        chunk_size=16,
        ppo_epochs=4,
        init_kl_coef=0.1,
        target=6,
        horizon=10000,
        gamma=1,
        lam=0.95,
        cliprange=0.2,
        cliprange_value=0.2,
        vf_coef=0.2,
        scale_reward=None,
        ref_mean=None,
        ref_std=None,
        cliprange_reward=10,
        gen_kwargs={
            "max_new_tokens": 50,
        },
    ),
)

if __name__ == "__main__":

    def reward_fn(samples, **kwargs):
        return [0] * len(samples)

    trainer = trlx.train(
        reward_fn=reward_fn,
        prompts=['dummy dataset'],
        config=config,
    )

Launch command:

CUDA_VISIBLE_DEVICES=0,1 debug=true accelerate launch --mixed_precision bf16 trlx_minimal.py

Error:

File "/home/olivia/experiments/cot_reliability/trlx_minimal.py", line 73, in <module>
    trainer = trlx.train(
  File "/home/olivia/miniconda3/envs/exps/lib/python3.9/site-packages/trlx/trlx.py", line 92, in train
    trainer = get_trainer(config.train.trainer)(
  File "/home/olivia/miniconda3/envs/exps/lib/python3.9/site-packages/trlx/trainer/accelerate_ppo_trainer.py", line 74, in __init__
    if not hasattr(self.model, "frozen_head") and not self.model.peft_type:
  File "/home/olivia/miniconda3/envs/exps/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1695, in __getattr__
    raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: 'DistributedDataParallel' object has no attribute 'peft_type'

The error comes from these lines in accelerate_ppo_trainer.py:

self.model, self.opt, self.scheduler, rollout_loader = self.accelerator.prepare(
            self.model, self.opt, self.scheduler, rollout_loader
        )
self.store.clear_history()  # Clear the rollout store
if not hasattr(self.model, "frozen_head") and not self.model.peft_type:
    self.ref_model = self.get_arch(self.config)

self.model originally has a peft_type attribute set to None, but in multi-gpu mode it seems like the self.accelerator.prepare call wraps the model in a DistributedDataParallel which doesn't have this attribute.

We can get around this by storing the peft_type attribute from before accelerate.prepare and setting it afterwards. This makes the code run correctly.

However, even with this change, multi-gpu training does not work with using peft to implement LoRA.

If I uncomment the peft_config lines in the example script above and change num_layers_unfrozen to 1, then this seems to work correctly with single-gpu training. However, when I add a second GPU, then the script fails with an error saying that DistributedDataParallel has no attribute forward_hydra.

This problem can be fixed by removing all references to peft_type in accelerate_ppo_trainer.py. (This also makes the fix above unnecesary). When I do this it seems to be running correctly with LoRA on both GPUs. However, I am not familiar enough with this codebase to know if this fix introduces additional errors which are not obvious.

Which trlX version are you using?

trlx==0.7.0

Additional system and package information

python 3.9, transformers 4.35.0, accelerate 0.24.1, Ubuntu

The text was updated successfully, but these errors were encountered:

Jing-L97 · 2024-08-13T12:27:59Z

Hi, I met the same issue on left_type. Did you solve this in the end?

AliengirlLiv added the bug Something isn't working label Nov 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-GPU training errors with peft #581

Multi-GPU training errors with peft #581

AliengirlLiv commented Nov 20, 2023

Jing-L97 commented Aug 13, 2024

Multi-GPU training errors with peft #581

Multi-GPU training errors with peft #581

Comments

AliengirlLiv commented Nov 20, 2023

🐛 Describe the bug

Which trlX version are you using?

Additional system and package information

Jing-L97 commented Aug 13, 2024