It takes a lot of time to resume training after running evaluation [FSDP]. What can be the reason behind this? #20257
Unanswered
psr-ai
asked this question in
DDP / multi-GPU / multi-node
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
My training processes runs as follows:
----- 1. Sanity Check Validation ------
----- 2. Training ---------------------
----- 3. Validation -------------------
----- 4. Training ---------------------
My model training transitions from Sanity Check Validation to Training in no amount of time (Step 1 to 2), but from Validation to Training (Step 3 to 4) it takes a lot of time. I am training a large language model under FSDP strategy. What can be the reason?
Beta Was this translation helpful? Give feedback.
All reactions