Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Peft LoRA checkpoint will not be saved if DDP enabled and PEFT is enabled #114

Closed
1 of 2 tasks
billweasley opened this issue Jul 14, 2024 · 1 comment
Closed
1 of 2 tasks

Comments

@billweasley
Copy link

billweasley commented Jul 14, 2024

System Info

Same as #113

Information

  • The official example scripts
  • My own modified scripts

🐛 Describe the bug

Train with the following parameter combinations:
enable_ddp=True
enable_fsdp=False
use_peft=True
freeze_llm=False

The LoRA model will not be saved anywhere, is this expected?
Checked the code save_model_checkpoint_peft in checkpoint_handler.py Does not seem to use save_pretrained to save peft model, and the version used in FSDP seems more reasonable, which saves peft model if use_peft=True and freeze_llm=True

I have to change the code as the following to save it in a subpath under <SAVE_PATH>/llm

def save_model_checkpoint_peft(model, optimizer, rank, cfg, checkpoint_name="checkpoint", save_trainable_only=True):
    logger.info(f"--> saving model ...")
    save_dir = os.path.join(cfg.output_dir, checkpoint_name)
    os.makedirs(save_dir, exist_ok=True)
    save_full_path = os.path.join(save_dir, "model.pt")

    if cfg.enable_ddp:
        model = model.module
    cpu_state = model.state_dict()
    # ===== added by me
    llm_output = os.path.join(save_dir, "llm/")
    os.makedirs(llm_output, exist_ok=True)
    if not cfg.freeze_llm:
        llm_dict = {}
        for key in cpu_state.keys():
            if key.startswith("llm."):
                llm_dict[key] = cpu_state[key]
        model.llm.save_pretrained(save_directory=llm_output, state_dict=llm_dict)
        logger.info(f"llm saved at {save_dir}")
    # ===== end
    #if save_trainable_only:
        #state_dict = OrderedDict()
        #for name, para in model.named_parameters():
            #if para.requires_grad:
            #    state_dict[name] = cpu_state[name]
        # ===== added by me
    encoder_dict = {}
    if not cfg.freeze_encoder:
        for key in cpu_state.keys():
            if key.startswith("encoder."):
                encoder_dict[key] = cpu_state[key]
    for key in cpu_state.keys():
        if key.startswith("encoder_projector."):
            encoder_dict[key] = cpu_state[key]
    torch.save(encoder_dict, save_full_path)
        # ===== end
    #else:
    #    state_dict = cpu_state
    #torch.save(state_dict, save_full_path)
    logger.info(f"encoder saved at {save_full_path}")

Error logs

No logs but no Peft model will be saved.

Expected behavior

I think it shall be saved for Peft model (did I miss anything?)

@billweasley
Copy link
Author

Just saw this #103

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant