Peft LoRA checkpoint will not be saved if DDP enabled and PEFT is enabled #114

billweasley · 2024-07-14T11:31:22Z

System Info

Same as #113

Information

The official example scripts
My own modified scripts

🐛 Describe the bug

Train with the following parameter combinations:
enable_ddp=True
enable_fsdp=False
use_peft=True
freeze_llm=False

The LoRA model will not be saved anywhere, is this expected?
Checked the code save_model_checkpoint_peft in checkpoint_handler.py Does not seem to use save_pretrained to save peft model, and the version used in FSDP seems more reasonable, which saves peft model if use_peft=True and freeze_llm=True

I have to change the code as the following to save it in a subpath under <SAVE_PATH>/llm

def save_model_checkpoint_peft(model, optimizer, rank, cfg, checkpoint_name="checkpoint", save_trainable_only=True):
    logger.info(f"--> saving model ...")
    save_dir = os.path.join(cfg.output_dir, checkpoint_name)
    os.makedirs(save_dir, exist_ok=True)
    save_full_path = os.path.join(save_dir, "model.pt")

    if cfg.enable_ddp:
        model = model.module
    cpu_state = model.state_dict()
    # ===== added by me
    llm_output = os.path.join(save_dir, "llm/")
    os.makedirs(llm_output, exist_ok=True)
    if not cfg.freeze_llm:
        llm_dict = {}
        for key in cpu_state.keys():
            if key.startswith("llm."):
                llm_dict[key] = cpu_state[key]
        model.llm.save_pretrained(save_directory=llm_output, state_dict=llm_dict)
        logger.info(f"llm saved at {save_dir}")
    # ===== end
    #if save_trainable_only:
        #state_dict = OrderedDict()
        #for name, para in model.named_parameters():
            #if para.requires_grad:
            #    state_dict[name] = cpu_state[name]
        # ===== added by me
    encoder_dict = {}
    if not cfg.freeze_encoder:
        for key in cpu_state.keys():
            if key.startswith("encoder."):
                encoder_dict[key] = cpu_state[key]
    for key in cpu_state.keys():
        if key.startswith("encoder_projector."):
            encoder_dict[key] = cpu_state[key]
    torch.save(encoder_dict, save_full_path)
        # ===== end
    #else:
    #    state_dict = cpu_state
    #torch.save(state_dict, save_full_path)
    logger.info(f"encoder saved at {save_full_path}")

Error logs

No logs but no Peft model will be saved.

Expected behavior

I think it shall be saved for Peft model (did I miss anything?)

The text was updated successfully, but these errors were encountered:

billweasley · 2024-07-14T11:39:12Z

Just saw this #103

billweasley closed this as completed Jul 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Peft LoRA checkpoint will not be saved if DDP enabled and PEFT is enabled #114

Peft LoRA checkpoint will not be saved if DDP enabled and PEFT is enabled #114

billweasley commented Jul 14, 2024 •

edited

Loading

billweasley commented Jul 14, 2024

Peft LoRA checkpoint will not be saved if DDP enabled and PEFT is enabled #114

Peft LoRA checkpoint will not be saved if DDP enabled and PEFT is enabled #114

Comments

billweasley commented Jul 14, 2024 • edited Loading

System Info

Information

🐛 Describe the bug

Error logs

Expected behavior

billweasley commented Jul 14, 2024

billweasley commented Jul 14, 2024 •

edited

Loading