Add support for seed checkpoint creation for meta-init flow #172

wconstab · 2024-03-27T21:38:43Z

Stack from ghstack (oldest at bottom):

Adds new command ./create_seed_checkpoint.sh which largely
reuses code inside train.py to create the model and then save its
initial state as a step-0 checkpoint for use with meta-initialization
loading flow.

[ghstack-poisoned]

ghstack-source-id: c0f3c9c4605f933e8a21c61f79774cb9e5a22f85 Pull Request resolved: #172

[ghstack-poisoned]

ghstack-source-id: da50fcefcb67cdf7c5dae14f2a407e6adb17824c Pull Request resolved: #172

[ghstack-poisoned]

wanchaol · 2024-04-05T21:12:06Z

seed_checkpoint.py

+    tokenizer = create_tokenizer(tokenizer_type, job_config.model.tokenizer_path)
+
+    # build model (using meta init)
+    model_cls = model_name_to_cls[model_name]


Wondering could we do some benchmarks about the time needed to start training with the seed checkpoint? I am a bit worrying that if that takes very long then this approach might not be our desired solution.

yea, its a good point. what configuration should i benchmark? i have only been running tiny models.

but any 'real' training should expect to save and load from checkpoints periodically due to faults. so i think we need to have fast enough ckpt load that we can live with during training.

probably a bigger deal just for UX for iterative development, but for small models it is not a noticeable amount of time.

the only workaround i can think of is to write a custom 'initializer' function for our model that we can call safely on a post-PP-split model chunk. this is complex/ugly, but should be fast, but only helps the first launch of training not the ckpt resume.

Sorry just saw this! I think ideally benchmarking a 13B/70B model loading seed checkpoints.

The real training loading checkpoints when restarting actually make sense to me, so I guess we can live with the seed checkpoint if that's the best UX.

For the workaround, iiuc pipeline splitting would only need to check: 1. first embedding layer exist 2. last projection layer exist. For the transformerblock module list it does not need to check anything as it's a for loop anyway?

[ghstack-poisoned]

kwen2501

LGTM!
Thanks for doing the foundational work!

kwen2501 · 2024-05-02T16:40:17Z

create_seed_checkpoint.sh

+torchrun --nproc_per_node=${NGPU} --rdzv_backend c10d --rdzv_endpoint="localhost:0" \
+--local-ranks-filter ${LOG_RANK} --role rank --tee 3 \
+train.py --job.config_file ${CONFIG_FILE} $seed_checkpoint $overrides


If NGPU is always 1 for creating seed checkpoint, shall we just launch the script by:
python train.py
?
Not sure if it would work directly though.

it doesn't work bc the script still expects things like WORLD_SIZE to be set.

I considered just hardcoding the envs inside the launcher, but, why not just keep it simple

wanchaol

looks good, have some questions and suggestions inlined

wanchaol · 2024-05-02T16:43:12Z

train.py

+
+    if job_config.checkpoint.create_seed_checkpoint:
+        assert (
+            world_size == 1


I have some questions about the meta init workflow, assuming we are planning to train 70b model with PP. But initially we want to create a seed checkpoint so that we can use it later for meta init load. Do it mean that we need to init the 70b model on CPU first? If we init a debug model and save its seed checkpoint, I guess that won't be reusable in the later 70b model?

the seed checkpoint must match the model we're training.

I suppose i have not even tested the seed creation on cpu vs gpu- probably we need to expose more options or smarts to determine which device to use here.

Do it mean that we need to init the 70b model on CPU first?

yes, that's right. if this isn't OK, i think we can try to hand-write some initializer functions that can work with the PP traced function, but maybe that can come later?

Note: even if we say "ditch pipeline tracer", it wont fix the initializer problems. We'd still need to customize the model's init_weights functions so that they can work given a 'model chunk' instead of a whole model. And then we'd also have to tolerate some RNG divergence from the non-PP case, or, implement RNG seed passing and coordination.

Yeah that can come later. My main worry is that for larger models like 70B/400B, this is going to be quite a initialization bottleneck, that we need to do:

create a 70B/400B model on CPU and save it to disk, probably take a couple of minutes

trace the 70B/400B model, take another tensor of couple of minutes

load the checkpoint, take another couple of minutes (this is fine I think as training anyways always want to save/load checkpoints)

Maybe we should brainstorming more on how PP could support meta init in a more sound way, i.e. maybe a PipelineStage.init_weights that wait for its RNG (if not the first stage), init the current meta model's weights, and transmit the RNGs to next stage

trace the 70B/400B model, take another tensor of couple of minutes

do you think the tracing time will be long? I actually assumed that the tracing time would not be an issue, it should be more related to the number of operators than the parameter size.

Maybe we should brainstorming more on how PP could support meta init in a more sound way,

I am kind of ambivalent about this. Yes, i like the idea, even proposed it at one point. But on the other hand I see quite a few more important issues to solve first. And this idea is quite complicated, so it should be justified. So this seems like something P1 to me.

wanchaol · 2024-05-02T16:44:16Z

create_seed_checkpoint.sh

+
+torchrun --nproc_per_node=${NGPU} --rdzv_backend c10d --rdzv_endpoint="localhost:0" \
+--local-ranks-filter ${LOG_RANK} --role rank --tee 3 \
+train.py --job.config_file ${CONFIG_FILE} $seed_checkpoint $overrides


given that seed checkpoint requires no parallelisms, we should just provide the overrides here (i.e. training.dp_degree=1) to disable all parallelisms given that we already specify NGPU=1

yea thats a good point.

i didn't want to hardcode those kind of things inside train.py, but, in this script it seems exactly right to do this. i'll change it.

wanchaol · 2024-05-02T16:45:38Z

create_seed_checkpoint.sh

+# All rights reserved.
+
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.


Can we add some comments in the beginning of the file to explain what this script is used for and how to run this?

[ghstack-poisoned]

Adds new command ./create_seed_checkpoint.sh which largely reuses code inside train.py to create the model and then save its initial state as a step-0 checkpoint for use with meta-initialization loading flow. ghstack-source-id: 3e1aa9eab847c1f1341f22772ca8ae3688883454 Pull Request resolved: #172

Adds new command ./create_seed_checkpoint.sh which largely reuses code inside train.py to create the model and then save its initial state as a step-0 checkpoint for use with meta-initialization loading flow. ghstack-source-id: 3e1aa9eab847c1f1341f22772ca8ae3688883454 Pull Request resolved: pytorch#172

ghstack-source-id: eb584b26c23535d7d6db4e44c0074c2b4adf1515 Pull Request resolved: #172

Adds new command ./create_seed_checkpoint.sh which largely reuses code inside train.py to create the model and then save its initial state as a step-0 checkpoint for use with meta-initialization loading flow. ghstack-source-id: 3e1aa9eab847c1f1341f22772ca8ae3688883454 Pull Request resolved: pytorch#172

Update

5c23542

[ghstack-poisoned]

wconstab added a commit that referenced this pull request Mar 27, 2024

Add support for seed checkpoint creation for meta-init flow

4840e4a

ghstack-source-id: c0f3c9c4605f933e8a21c61f79774cb9e5a22f85 Pull Request resolved: #172

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 27, 2024

Update

0bd953e

[ghstack-poisoned]

wconstab added a commit that referenced this pull request Mar 27, 2024

Add support for seed checkpoint creation for meta-init flow

1070dc6

ghstack-source-id: da50fcefcb67cdf7c5dae14f2a407e6adb17824c Pull Request resolved: #172

Update

95af56c

[ghstack-poisoned]

This was referenced Mar 23, 2024

Add Pipeline Parallel (and 2D PP+FSDP) support #161

Closed

wip pipelinestage #174

Closed

wconstab added 2 commits March 29, 2024 14:59

Update

6a3bf53

[ghstack-poisoned]

Update

c484970

[ghstack-poisoned]

This was referenced Apr 5, 2024

Make freqs_cis a persistent buffer for pp init #201

Merged

Delete grad scaler, which is unsupported/unused #202

Merged

Factor out loss_fn to share code with pipeline par #203

Merged

wconstab added 3 commits April 5, 2024 10:25

Update

2dd1349

[ghstack-poisoned]

Update

ea28c04

[ghstack-poisoned]

Update

2bd4617

[ghstack-poisoned]

wanchaol reviewed Apr 5, 2024

View reviewed changes

wconstab added 4 commits April 5, 2024 14:21

Update

46a84fe

[ghstack-poisoned]

Update

2e8321e

[ghstack-poisoned]

Update

549ac26

[ghstack-poisoned]

Update on "Add support for seed checkpoint creation for meta-init flow"

cbe21e5

[ghstack-poisoned]

wconstab mentioned this pull request Apr 29, 2024

Enable TP+PP support #285

Closed

Update

4c9c3ae

[ghstack-poisoned]

wconstab mentioned this pull request May 1, 2024

run sdpa with dtensor #180

Closed

Update

c0164af

[ghstack-poisoned]

wconstab requested a review from wanchaol May 2, 2024 05:10

kwen2501 approved these changes May 2, 2024

View reviewed changes

wanchaol approved these changes May 2, 2024

View reviewed changes

Update

a57d458

[ghstack-poisoned]

wconstab merged commit a57d458 into gh/wconstab/3/base May 2, 2024
4 checks passed

wconstab deleted the gh/wconstab/3/head branch May 2, 2024 17:43

tianyu-l pushed a commit that referenced this pull request Aug 16, 2024

Add support for seed checkpoint creation for meta-init flow

00f899f

ghstack-source-id: eb584b26c23535d7d6db4e44c0074c2b4adf1515 Pull Request resolved: #172

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for seed checkpoint creation for meta-init flow #172

Add support for seed checkpoint creation for meta-init flow #172

wconstab commented Mar 27, 2024 •

edited

Loading

wanchaol Apr 5, 2024

wconstab Apr 5, 2024

wanchaol Apr 10, 2024

kwen2501 left a comment

kwen2501 May 2, 2024

wconstab May 2, 2024

wanchaol left a comment

wanchaol May 2, 2024

wconstab May 2, 2024

wanchaol May 2, 2024

wconstab May 2, 2024 •

edited

Loading

wanchaol May 2, 2024

wconstab May 2, 2024

wanchaol May 2, 2024

Add support for seed checkpoint creation for meta-init flow #172

Add support for seed checkpoint creation for meta-init flow #172

Conversation

wconstab commented Mar 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kwen2501 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wanchaol left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wconstab May 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wconstab commented Mar 27, 2024 •

edited

Loading

wconstab May 2, 2024 •

edited

Loading