Enable optional checkpoint at loading #819

mori360 · 2025-02-04T22:59:52Z

Add argument "--checkpoint.exclude" to provide users to exclude specific keys from being loaded from the checkpoint.

if checkpoint.exclude contains "dataloder", users could load with different dp_degree as dataloader is excluded without resharding
if checkpoint.exclude contains "lr_scheduler", lr_scheduler would count from step 0

mori360 · 2025-02-04T23:03:25Z

torchtitan/checkpoint.py

@@ -169,12 +169,6 @@ def __init__(
            into one state dict before saving/loading. We rely on the individual state_dicts to not collide,
            which is gauranteed for the model by correct pipeline splitting and for the optimizer by the flattening
            support described in (1).
-
-        3. LR schedulers also index model states like optimizers and would need to be flattened properly to support


lr_scheduler flatten at #794

We should add a comment here to say the lr_scheduler resharding assumes that all lr_schedulers are the same.

mori360 · 2025-02-04T23:04:32Z

torchtitan/config_manager.py

@@ -511,6 +511,16 @@ def __init__(self):
            default=-1,
            help="Load the checkpoint at the specified step. If -1, load the latest checkpoint.",
        )
+        self.parser.add_argument(


currently checkpoint.exclude only support excluding at loading, shall we use argument like exclude_from_loading?

yes, exclude_from_loading is more explicit.

mori360 · 2025-02-04T23:49:10Z

tests/integration_tests.py

+            ],
+            "Optional checkpoint",
+            "optional_checkpoint",
+        ),


add integration test here, especially for that optional checkpoint at dataloader could avoid dp_degree mismatch error before and after checkpoint

fegin

LGTM, but please change the comments

fegin · 2025-02-05T01:42:46Z

torchtitan/checkpoint.py

@@ -169,12 +169,6 @@ def __init__(
            into one state dict before saving/loading. We rely on the individual state_dicts to not collide,
            which is gauranteed for the model by correct pipeline splitting and for the optimizer by the flattening
            support described in (1).
-
-        3. LR schedulers also index model states like optimizers and would need to be flattened properly to support


We should add a comment here to say the lr_scheduler resharding assumes that all lr_schedulers are the same.

fegin · 2025-02-05T01:45:37Z

torchtitan/checkpoint.py

+        shadow_states = {k: v for k, v in states.items() if k not in self.exclude}
+        for exclude_key in self.exclude:
+            if exclude_key not in states:
+                logger.warning(f"{exclude_key} not found in state_dict, skipping")


We should just raise an exception. So a better way to do this is

if not set(self.exclude).issubset(set(states.keys()): raise ValueError("...")

fegin · 2025-02-05T01:46:06Z

torchtitan/config_manager.py

@@ -511,6 +511,16 @@ def __init__(self):
            default=-1,
            help="Load the checkpoint at the specified step. If -1, load the latest checkpoint.",
        )
+        self.parser.add_argument(


yes, exclude_from_loading is more explicit.

tianyu-l

In general looks good. Had several comments on details.
Plus we need to document the usage in https://github.com/pytorch/torchtitan/blob/main/docs/checkpoint.md
including the proper use cases mentioned in #809 (comment)

tianyu-l · 2025-02-05T04:16:57Z

torchtitan/checkpoint.py

-        optimizers do, so it's hard to write a generic 'flattener' utility.
-
-            TODO: This is currently unsolved and needs a fix.
+        3. LR schedulers also index model states like optimizers. Here we flatten the lr_schedulers by the ssumption that


Suggested change

3. LR schedulers also index model states like optimizers. Here we flatten the lr_schedulers by the ssumption that

3. LR schedulers also index model states like optimizers. Here we flatten the lr_schedulers with the assumption that

tianyu-l · 2025-02-05T04:20:42Z

torchtitan/checkpoint.py

@@ -203,6 +200,11 @@ def __init__(

        self.model_weights_only = ckpt_config.model_weights_only
        self.export_dtype = TORCH_DTYPE_MAP[ckpt_config.export_dtype]
+        self.exclude_from_loading = (
+            [item.strip() for item in ckpt_config.exclude_from_loading]


can we do this strip in the definition of string_list?

tianyu-l · 2025-02-05T04:23:44Z

torchtitan/checkpoint.py

+        self.exclude_from_loading = (
+            [item.strip() for item in ckpt_config.exclude_from_loading]
+            if ckpt_config.exclude_from_loading
+            else []


why this branch? Isn't it already a list after split in string_list?

tianyu-l · 2025-02-05T04:25:19Z

torchtitan/config_manager.py

+        self.parser.add_argument(
+            "--checkpoint.exclude_from_loading",
+            type=string_list,
+            default="",


The default should be []? as in

torchtitan/torchtitan/config_manager.py

Line 305 in 690f299

default=[],

If default is "", you'll always end up with [""] after string_split.
See https://docs.python.org/3.3/library/stdtypes.html

tianyu-l · 2025-02-05T04:29:24Z

tests/integration_tests.py

@@ -418,6 +418,22 @@ def build_test_list():
            "test_generate",
            ngpu=2,
        ),
+        OverrideDefinitions(


There are two tests missing:

unit tests (on CPU). Passing the test here doesn't mean the behavior is correct. We should add a unit test similar to https://github.com/pytorch/torchtitan/blob/690f299d37c5f6d34273762c0d650888a754d3c0/tests/unit_tests/test_dataset_checkpointing.py

The test here only covers the cmd line arg override, but it could be problematic if specified in toml. We need to add a test similar to

torchtitan/tests/unit_tests/test_job_config.py

Line 48 in 690f299

def test_parse_pp_split_points(self):

Add test here with comments. In the optional checkpoint, we save at [dp:4] and load at [dp:2, tp:2], dataloader should be excluded in loading, otherwise would raise error for dp_degree mismatch

tianyu-l · 2025-02-05T04:31:37Z

torchtitan/checkpoint.py

+            k: v for k, v in states.items() if k not in self.exclude_from_loading
+        }
+        for exclude_key in self.exclude_from_loading:
+            if exclude_key != "" and exclude_key not in states:


we should filter "" (and any empty space) out in string_list

tianyu-l · 2025-02-05T04:32:31Z

torchtitan/checkpoint.py

+        }
+        for exclude_key in self.exclude_from_loading:
+            if exclude_key != "" and exclude_key not in states:
+                raise ValueError(f"{exclude_key} not found in state_dict, skipping")


what do you mean by "skipping" when you raise an exception. Technically it should be "failing"?

tianyu-l · 2025-02-05T04:33:43Z

torchtitan/checkpoint.py

@@ -435,10 +437,17 @@ def load(self, step: int = -1) -> bool:
        }
        logger.info(f"Loading the checkpoint at step {step}.")
        begin = time.monotonic()
+        shadow_states = {


can you explain more about the naming? I'd call it states_to_load

tianyu-l · 2025-02-05T04:34:53Z

torchtitan/config_manager.py

@@ -665,6 +682,9 @@ def parse_args_from_command_line(
                # since the inferred type is just 'list' and it ends up flattening
                # e.g. from ["layers.0", "layers.1"] into ["l", "a", "y", "e", "r", "s", ".0", ...]
                aux_parser.add_argument("--" + arg, type=string_list)
+            elif arg == "checkpoint.exclude_from_loading":
+                # same as above for checkpoint.exclude_from_loading


Suggested change

# same as above for checkpoint.exclude_from_loading

# similar to the case above

fegin · 2025-02-05T21:39:30Z

Can you publish the PR?

tianyu-l

lgtm, thank you!
please address remaining comments before merging.

tianyu-l · 2025-02-06T22:14:52Z

torchtitan/checkpoint.py

@@ -435,10 +433,17 @@ def load(self, step: int = -1) -> bool:
        }
        logger.info(f"Loading the checkpoint at step {step}.")
        begin = time.monotonic()
+        state_to_load = {


Suggested change

state_to_load = {

states_to_load = {

tianyu-l · 2025-02-06T22:18:31Z

torchtitan/config_manager.py

@@ -511,6 +511,17 @@ def __init__(self):
            default=-1,
            help="Load the checkpoint at the specified step. If -1, load the latest checkpoint.",
        )
+        self.parser.add_argument(
+            "--checkpoint.exclude_from_loading",
+            type=string_list,


shall we still do .strip and empty check in string_list?

tianyu-l · 2025-02-06T22:21:57Z

tests/unit_tests/test_job_config.py

+        toml_splits = ["optimizer", "lr_scheduler", "dataloader"]
+        toml_split_str = ",".join(toml_splits)
+        cmdline_splits = ["optimizer", "lr_scheduler", "dataloader"]
+        cmdline_split_str = ",".join(cmdline_splits)


the point of having two sets is that we can test override, e.g. in "toml has split points, cmdline overrides them".
So we need to make them different to test robustness.

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 4, 2025

enable optional checkpoint at loading

58466d5

mori360 force-pushed the optional_checkpoint branch from 4491e62 to 58466d5 Compare February 4, 2025 23:02

mori360 commented Feb 4, 2025

View reviewed changes

add unit test

759c545

mori360 commented Feb 4, 2025

View reviewed changes

mori360 requested review from tianyu-l and fegin February 5, 2025 00:29

mori360 marked this pull request as ready for review February 5, 2025 00:29

fegin reviewed Feb 5, 2025

View reviewed changes

fegin mentioned this pull request Feb 5, 2025

FSDP checkpoints don't load when run is restarted with greater world size #811

Closed

mori360 added 2 commits February 4, 2025 19:06

change argument name, add runtimeerror

47f914a

add corner case

673013b

tianyu-l requested changes Feb 5, 2025

View reviewed changes

tianyu-l linked an issue Feb 5, 2025 that may be closed by this pull request

FSDP checkpoints don't load when run is restarted with greater world size #811

Closed

mori360 marked this pull request as draft February 5, 2025 05:09

mori360 added 4 commits February 4, 2025 21:09

change argument default value

7418f60

integration_tests

a5c0006

add doc

8e31858

add nargs

2fb6f55

mori360 marked this pull request as ready for review February 5, 2025 21:40

tianyu-l mentioned this pull request Feb 6, 2025

Question about Project Status and Potential Contributions fla-org/flame#1

Open

mori360 added 2 commits February 6, 2025 09:31

add comment

c3d2370

update comments

096d506

mori360 requested a review from tianyu-l February 6, 2025 18:30

tianyu-l approved these changes Feb 6, 2025

View reviewed changes

mori360 added 2 commits February 6, 2025 14:37

add string_list process with stripe and empty check

b1f1d5d

Merge branch 'main' into optional_checkpoint

582fe7d

mori360 merged commit 49c6d6f into pytorch:main Feb 7, 2025
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable optional checkpoint at loading #819

Enable optional checkpoint at loading #819

mori360 commented Feb 4, 2025

mori360 Feb 4, 2025 •

edited

Loading

fegin Feb 5, 2025

mori360 Feb 4, 2025

fegin Feb 5, 2025

mori360 Feb 4, 2025

fegin left a comment •

edited

Loading

fegin Feb 5, 2025

fegin Feb 5, 2025

fegin Feb 5, 2025

tianyu-l left a comment

tianyu-l Feb 5, 2025

tianyu-l Feb 5, 2025

tianyu-l Feb 5, 2025

tianyu-l Feb 5, 2025

tianyu-l Feb 5, 2025

mori360 Feb 6, 2025

tianyu-l Feb 5, 2025

tianyu-l Feb 5, 2025

tianyu-l Feb 5, 2025

tianyu-l Feb 5, 2025

fegin commented Feb 5, 2025

tianyu-l left a comment

tianyu-l Feb 6, 2025

tianyu-l Feb 6, 2025

tianyu-l Feb 6, 2025

	3. LR schedulers also index model states like optimizers. Here we flatten the lr_schedulers by the ssumption that
	3. LR schedulers also index model states like optimizers. Here we flatten the lr_schedulers with the assumption that

	# same as above for checkpoint.exclude_from_loading
	# similar to the case above

Enable optional checkpoint at loading #819

Enable optional checkpoint at loading #819

Conversation

mori360 commented Feb 4, 2025

mori360 Feb 4, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fegin left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tianyu-l left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fegin commented Feb 5, 2025

tianyu-l left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mori360 Feb 4, 2025 •

edited

Loading

fegin left a comment •

edited

Loading