It takes a lot of time to resume training after running evaluation [FSDP]. What can be the reason behind this? #20257

psr-ai · 2024-09-06T22:17:36Z

psr-ai
Sep 6, 2024

My training processes runs as follows:

----- 1. Sanity Check Validation ------
----- 2. Training ---------------------
----- 3. Validation -------------------
----- 4. Training ---------------------

My model training transitions from Sanity Check Validation to Training in no amount of time (Step 1 to 2), but from Validation to Training (Step 3 to 4) it takes a lot of time. I am training a large language model under FSDP strategy. What can be the reason?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

It takes a lot of time to resume training after running evaluation [FSDP]. What can be the reason behind this? #20257

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

It takes a lot of time to resume training after running evaluation [FSDP]. What can be the reason behind this? #20257

psr-ai Sep 6, 2024

Replies: 0 comments

psr-ai
Sep 6, 2024