Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-GPU training errors with peft #581

Open
AliengirlLiv opened this issue Nov 20, 2023 · 1 comment
Open

Multi-GPU training errors with peft #581

AliengirlLiv opened this issue Nov 20, 2023 · 1 comment
Labels
bug Something isn't working

Comments

@AliengirlLiv
Copy link

🐛 Describe the bug

When I try to use multi-gpu training with accelerate I get an error.

Code:

import trlx
from peft import LoraConfig, TaskType
from trlx.data.configs import (
    ModelConfig,
    OptimizerConfig,
    SchedulerConfig,
    TokenizerConfig,
    TrainConfig,
    TRLConfig,
)
from trlx.models.modeling_ppo import PPOConfig

config = TRLConfig(
    train=TrainConfig(
        seq_length=1024,
        epochs=50,
        total_steps=100000,
        batch_size=1,
        checkpoint_interval=1000,
        eval_interval=200,
        pipeline="PromptPipeline",
        trainer="AcceleratePPOTrainer",
    ),
        model=ModelConfig(model_path='gpt2',
                          num_layers_unfrozen=1,
                        # peft_config={"peft_type": "LORA", "r": 1, "lora_alpha": 32, "lora_dropout": 0.1},
                        ),
        tokenizer=TokenizerConfig(tokenizer_path='gpt2', truncation_side="right"),
        optimizer=OptimizerConfig(name="adamw"),
    scheduler=SchedulerConfig(name="cosine_annealing", kwargs={"T_max": 100000, "eta_min": 5.0e-6},),
    method=PPOConfig(
        name="PPOConfig",
        num_rollouts=128,
        chunk_size=16,
        ppo_epochs=4,
        init_kl_coef=0.1,
        target=6,
        horizon=10000,
        gamma=1,
        lam=0.95,
        cliprange=0.2,
        cliprange_value=0.2,
        vf_coef=0.2,
        scale_reward=None,
        ref_mean=None,
        ref_std=None,
        cliprange_reward=10,
        gen_kwargs={
            "max_new_tokens": 50,
        },
    ),
)

if __name__ == "__main__":

    def reward_fn(samples, **kwargs):
        return [0] * len(samples)

    trainer = trlx.train(
        reward_fn=reward_fn,
        prompts=['dummy dataset'],
        config=config,
    )

Launch command:

CUDA_VISIBLE_DEVICES=0,1 debug=true accelerate launch --mixed_precision bf16 trlx_minimal.py

Error:

File "/home/olivia/experiments/cot_reliability/trlx_minimal.py", line 73, in <module>
    trainer = trlx.train(
  File "/home/olivia/miniconda3/envs/exps/lib/python3.9/site-packages/trlx/trlx.py", line 92, in train
    trainer = get_trainer(config.train.trainer)(
  File "/home/olivia/miniconda3/envs/exps/lib/python3.9/site-packages/trlx/trainer/accelerate_ppo_trainer.py", line 74, in __init__
    if not hasattr(self.model, "frozen_head") and not self.model.peft_type:
  File "/home/olivia/miniconda3/envs/exps/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1695, in __getattr__
    raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: 'DistributedDataParallel' object has no attribute 'peft_type'

The error comes from these lines in accelerate_ppo_trainer.py:

self.model, self.opt, self.scheduler, rollout_loader = self.accelerator.prepare(
            self.model, self.opt, self.scheduler, rollout_loader
        )
self.store.clear_history()  # Clear the rollout store
if not hasattr(self.model, "frozen_head") and not self.model.peft_type:
    self.ref_model = self.get_arch(self.config)

self.model originally has a peft_type attribute set to None, but in multi-gpu mode it seems like the self.accelerator.prepare call wraps the model in a DistributedDataParallel which doesn't have this attribute.

We can get around this by storing the peft_type attribute from before accelerate.prepare and setting it afterwards. This makes the code run correctly.

However, even with this change, multi-gpu training does not work with using peft to implement LoRA.

If I uncomment the peft_config lines in the example script above and change num_layers_unfrozen to 1, then this seems to work correctly with single-gpu training. However, when I add a second GPU, then the script fails with an error saying that DistributedDataParallel has no attribute forward_hydra.

This problem can be fixed by removing all references to peft_type in accelerate_ppo_trainer.py. (This also makes the fix above unnecesary). When I do this it seems to be running correctly with LoRA on both GPUs. However, I am not familiar enough with this codebase to know if this fix introduces additional errors which are not obvious.

Which trlX version are you using?

trlx==0.7.0

Additional system and package information

python 3.9, transformers 4.35.0, accelerate 0.24.1, Ubuntu

@AliengirlLiv AliengirlLiv added the bug Something isn't working label Nov 20, 2023
@Jing-L97
Copy link

Hi, I met the same issue on left_type. Did you solve this in the end?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants