You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
File "/home/olivia/experiments/cot_reliability/trlx_minimal.py", line 73, in <module>
trainer = trlx.train(
File "/home/olivia/miniconda3/envs/exps/lib/python3.9/site-packages/trlx/trlx.py", line 92, in train
trainer = get_trainer(config.train.trainer)(
File "/home/olivia/miniconda3/envs/exps/lib/python3.9/site-packages/trlx/trainer/accelerate_ppo_trainer.py", line 74, in __init__
if not hasattr(self.model, "frozen_head") and not self.model.peft_type:
File "/home/olivia/miniconda3/envs/exps/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1695, in __getattr__
raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: 'DistributedDataParallel' object has no attribute 'peft_type'
The error comes from these lines in accelerate_ppo_trainer.py:
self.model, self.opt, self.scheduler, rollout_loader = self.accelerator.prepare(
self.model, self.opt, self.scheduler, rollout_loader
)
self.store.clear_history() # Clear the rollout store
if not hasattr(self.model, "frozen_head") and not self.model.peft_type:
self.ref_model = self.get_arch(self.config)
self.model originally has a peft_type attribute set to None, but in multi-gpu mode it seems like the self.accelerator.prepare call wraps the model in a DistributedDataParallel which doesn't have this attribute.
We can get around this by storing the peft_type attribute from before accelerate.prepare and setting it afterwards. This makes the code run correctly.
However, even with this change, multi-gpu training does not work with using peft to implement LoRA.
If I uncomment the peft_config lines in the example script above and change num_layers_unfrozen to 1, then this seems to work correctly with single-gpu training. However, when I add a second GPU, then the script fails with an error saying that DistributedDataParallel has no attribute forward_hydra.
This problem can be fixed by removing all references to peft_type in accelerate_ppo_trainer.py. (This also makes the fix above unnecesary). When I do this it seems to be running correctly with LoRA on both GPUs. However, I am not familiar enough with this codebase to know if this fix introduces additional errors which are not obvious.
🐛 Describe the bug
When I try to use multi-gpu training with accelerate I get an error.
Code:
Launch command:
Error:
The error comes from these lines in
accelerate_ppo_trainer.py
:self.model
originally has apeft_type
attribute set to None, but in multi-gpu mode it seems like theself.accelerator.prepare
call wraps the model in a DistributedDataParallel which doesn't have this attribute.We can get around this by storing the
peft_type
attribute from beforeaccelerate.prepare
and setting it afterwards. This makes the code run correctly.However, even with this change, multi-gpu training does not work with using peft to implement LoRA.
If I uncomment the
peft_config
lines in the example script above and changenum_layers_unfrozen
to 1, then this seems to work correctly with single-gpu training. However, when I add a second GPU, then the script fails with an error saying that DistributedDataParallel has no attributeforward_hydra
.This problem can be fixed by removing all references to
peft_type
inaccelerate_ppo_trainer.py
. (This also makes the fix above unnecesary). When I do this it seems to be running correctly with LoRA on both GPUs. However, I am not familiar enough with this codebase to know if this fix introduces additional errors which are not obvious.Which trlX version are you using?
trlx==0.7.0
Additional system and package information
python 3.9, transformers 4.35.0, accelerate 0.24.1, Ubuntu
The text was updated successfully, but these errors were encountered: