Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't resume DiT training #1269

Open
2 tasks done
carzacc opened this issue Aug 26, 2023 · 2 comments
Open
2 tasks done

Can't resume DiT training #1269

carzacc opened this issue Aug 26, 2023 · 2 comments

Comments

@carzacc
Copy link

carzacc commented Aug 26, 2023

I am using DiT, and trying to finetune for layout analysis (object detection) on a dataset other than PubLayNet (end goal is to fine tune it to go beyond its current classification capabilities).

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below)

If I try to resume training, whether I use the config in object_detection or the generated one in the output directory, I get warnings like the following:

WARNING [08/26 17:26:55 fvcore.common.checkpoint]: Some model parameters or buffers are not found in the checkpoint:
backbone.fpn_lateral2.{bias, weight}
backbone.fpn_lateral3.{bias, weight}
backbone.fpn_lateral4.{bias, weight}
backbone.fpn_lateral5.{bias, weight}
backbone.fpn_output2.{bias, weight}
backbone.fpn_output3.{bias, weight}
backbone.fpn_output4.{bias, weight}
backbone.fpn_output5.{bias, weight}
proposal_generator.rpn_head.anchor_deltas.{bias, weight}
proposal_generator.rpn_head.conv.{bias, weight}
proposal_generator.rpn_head.objectness_logits.{bias, weight}
roi_heads.box_head.fc1.{bias, weight}
roi_heads.box_head.fc2.{bias, weight}
roi_heads.box_predictor.bbox_pred.{bias, weight}
roi_heads.box_predictor.cls_score.{bias, weight}
roi_heads.mask_head.deconv.{bias, weight}
roi_heads.mask_head.mask_fcn1.{bias, weight}
roi_heads.mask_head.mask_fcn2.{bias, weight}
roi_heads.mask_head.mask_fcn3.{bias, weight}
roi_heads.mask_head.mask_fcn4.{bias, weight}
roi_heads.mask_head.predictor.{bias, weight}
WARNING [08/26 17:26:55 fvcore.common.checkpoint]: The checkpoint state_dict contains keys that are not used by the model:
  backbone.bottom_up.backbone.backbone.fpn_lateral2.{bias, weight}
  backbone.bottom_up.backbone.backbone.fpn_output2.{bias, weight}
  backbone.bottom_up.backbone.backbone.fpn_lateral3.{bias, weight}
  backbone.bottom_up.backbone.backbone.fpn_output3.{bias, weight}
  backbone.bottom_up.backbone.backbone.fpn_lateral4.{bias, weight}
  backbone.bottom_up.backbone.backbone.fpn_output4.{bias, weight}
  backbone.bottom_up.backbone.backbone.fpn_lateral5.{bias, weight}
  backbone.bottom_up.backbone.backbone.fpn_output5.{bias, weight}
  backbone.bottom_up.backbone.proposal_generator.rpn_head.conv.{bias, weight}
  backbone.bottom_up.backbone.proposal_generator.rpn_head.objectness_logits.{bias, weight}
  backbone.bottom_up.backbone.proposal_generator.rpn_head.anchor_deltas.{bias, weight}
  backbone.bottom_up.backbone.roi_heads.box_head.fc1.{bias, weight}
  backbone.bottom_up.backbone.roi_heads.box_head.fc2.{bias, weight}
  backbone.bottom_up.backbone.roi_heads.box_predictor.cls_score.{bias, weight}
  backbone.bottom_up.backbone.roi_heads.box_predictor.bbox_pred.{bias, weight}
  backbone.bottom_up.backbone.roi_heads.mask_head.mask_fcn1.{bias, weight}
  backbone.bottom_up.backbone.roi_heads.mask_head.mask_fcn2.{bias, weight}
  backbone.bottom_up.backbone.roi_heads.mask_head.mask_fcn3.{bias, weight}
  backbone.bottom_up.backbone.roi_heads.mask_head.mask_fcn4.{bias, weight}
  backbone.bottom_up.backbone.roi_heads.mask_head.deconv.{bias, weight}
  backbone.bottom_up.backbone.roi_heads.mask_head.predictor.{bias, weight}

and, when training starts, the loss starts very high (random, between 2 and 5) instead of being close to 0.5 which is where I had left it.

I need to be able to resume because of the policies of the university cluster I am using which don't allow me to train for long sessions.

  • Platform: Ubuntu 20.04.6, CUDA 11.4
  • Python version: 3.9.17
  • PyTorch version (GPU?): 1.9.1+cu111
@carzacc
Copy link
Author

carzacc commented Aug 26, 2023

By the way, just FYI, I have opened PR #1242 which corrects one of the training examples in the README.

@carzacc
Copy link
Author

carzacc commented Sep 1, 2023

https://github.com/microsoft/unilm/blob/b60c741f746877293bb85eed6806736fc8fa0ffd/dit/object_detection/ditod/mycheckpointer.py#L199C1-L205C10

by not appending that prefix I managed to fix the issue and correctly resume training, why is that there?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant