torch.compile each TransformerBlock instead of the whole model #268

wanchaol · 2024-04-25T01:19:36Z

This way we could temporarily enable 2-D parallel compile, and it might make sense to do transformer block compile in the future with PP (which we'll see).

We should figure out the dynamic shape issue though

awgu

Sounds good to me!

torchtitan/parallelisms/parallelize_llama.py

This way we could temporarily enable 2-D parallel compile, and it make more sense to do transformer block compile in the future with PP anyways. We should figure out the dynamic shape issue though

tianyu-l

lgtm! Some nit comments on how to organize things.

tianyu-l · 2024-04-26T04:49:54Z

train.py

    if job_config.training.compile:
        if (
            job_config.activation_checkpoint.mode == "selective"
            and job_config.activation_checkpoint.selective_ac_option == "op"
        ):
+            # some flags for torch.compile enablement
            torch._dynamo.config._experimental_support_context_fn_in_torch_utils_checkpoint = (
                True
            )
-        logger.info("Compiling model with torch.compile")
-        model = torch.compile(model)
+        logger.info("Compiling each TransformerBlock with torch.compile")


nit: Since we put parallelization, ac, and compile all in parallelize_llama.py, it might make more sense to put this all in that file. This will simplify train.py.

Concretely, we can put this block to parallelize_llama.py, right before logger.info("Applied FSDP to the model") and change the wording from "Compiling each ..." to "Compiled each ..."

tianyu-l · 2024-04-26T04:50:12Z

train.py

@@ -219,17 +219,16 @@ def loss_fn(pred, labels):

    metric_logger = build_metric_logger(job_config)

-    # torch.compile model for improved performance
    if job_config.training.compile:


nit: since we are moving compiling to parallelize_llama, let's add some comment on L201.

tianyu-l · 2024-04-26T05:07:54Z

torchtitan/parallelisms/parallelize_llama.py

+            logger.info(f"Applied {ac_mode} activation checkpointing to the model")
+
+        if job_config.training.compile:
+            # turn on per-transformer block compile after AC wrappnig and before FSDP


typo: wrappnig -> wrapping

awgu · 2024-04-26T15:37:30Z

torchtitan/parallelisms/parallelize_llama.py

+            torch._dynamo.config._experimental_support_context_fn_in_torch_utils_checkpoint = (
+                True
+            )
+    if enable_compile:


nit: Can this error check be moved up to line 216?

awgu · 2024-04-26T15:38:41Z

Should we be able to close #61 after this PR?

Also, do we need to run end-to-end numerics testing?

wanchaol · 2024-04-26T16:53:32Z

Should we be able to close #61 after this PR?

Also, do we need to run end-to-end numerics testing?

@awgu yeah I think it should resolve that, I'll do some e2e benchmarking before landing, so this would likely take a while

wanchaol · 2024-04-26T16:54:12Z

I'll break up some other changes to land them first

kwen2501 · 2024-05-01T06:04:27Z

when PP is present, we may torch.compile the whole stage module, which is bigger than a transformer block, i.e.

pipe = pipeline(model, ...)
stage_mod = pipe.get_stage_module(stage_idx)
stage_mod = torch.compile(stage_mod)
stage = PipelineStage(stage_mod, ...)

It would also allow the code to be more model-agnostic -- there is no transformer_block, layer_id or model.layers here.

chrisociepa · 2024-05-02T11:10:29Z

In my case, enabling compilation in this way (per-layer) causes a memory leak

wanchaol · 2024-05-02T21:36:33Z

In my case, enabling compilation in this way (per-layer) causes a memory leak

🤔 interesting, how did you observe that?

fwiw this doesn't work out of box, as it trigger some non-trival numeric issues, I'm going to leave this PR here until I resolved it. Opening a new PR to turn dynamic shape off so that it works for both 1D and 2D compile

chrisociepa · 2024-05-03T21:51:51Z

With each iteration, the memory usage increases and eventually results in OOM. However, just to be clear, I haven't tested this on your entire code, only on a part of it. Adding per-layer compilation causes a memory leak with each iteration. I know that the memory leak might be related to my implementation, so I just wanted to bring this issue to your attention. If you don't observe this in your code, then it's likely an issue on my end.
Meanwhile, I didn't observe the numerical issues you mentioned, unless they are a direct consequence of the memory leak. The loss function appears practically the same with layer compilation both enabled and disabled.

yifuwang · 2024-05-03T22:25:10Z

torchtitan/parallelisms/parallelize_llama.py

+        if enable_compile:
+            # turn on per-transformer block compile after AC wrapping and before FSDP
+            # TODO: dynamic shape have some issues so we turn it off for now.
+            transformer_block = torch.compile(transformer_block, dynamic=False)


Curious how far we are from being able to enable fullgraph=True?

I think fullgraph=True should work already when we moving to per-TransformerBlock compile, maybe I can add that flag too.

I ended up just turn dynamic=False in the current full mode compile #297 in this case we can't full_graph=True yet as FSDP still graph breaking, but for that case we should also captured each TransformerBlock as full graphs already

wanchaol · 2024-05-22T05:08:51Z

going to merge this given that:

2D compile currently broken and this PR work arounds it and make TP can be compiled again (we should separately figure out the full model compile issue) cc @bdhirsh
per-TransformerBlock compile would give us potential later once the cache reusing in torch.compile enabled, it would drastically improve the compile (code start and warm start) time. cc @anijain2305

This way we could temporarily enable 2-D parallel compile, and it might make sense to do transformer block compile in the future with PP (which we'll see). We should figure out: 1. dynamic shape issue when turning on 2D parallel 2. full model compile issue for 2D parallel compile 3. cache reusing currently does not work, enable it later

…ch#268) This way we could temporarily enable 2-D parallel compile, and it might make sense to do transformer block compile in the future with PP (which we'll see). We should figure out: 1. dynamic shape issue when turning on 2D parallel 2. full model compile issue for 2D parallel compile 3. cache reusing currently does not work, enable it later

This way we could temporarily enable 2-D parallel compile, and it might make sense to do transformer block compile in the future with PP (which we'll see). We should figure out: 1. dynamic shape issue when turning on 2D parallel 2. full model compile issue for 2D parallel compile 3. cache reusing currently does not work, enable it later

…ch#268) This way we could temporarily enable 2-D parallel compile, and it might make sense to do transformer block compile in the future with PP (which we'll see). We should figure out: 1. dynamic shape issue when turning on 2D parallel 2. full model compile issue for 2D parallel compile 3. cache reusing currently does not work, enable it later

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 25, 2024

wanchaol requested review from wconstab, Chillee, lessw2020, tianyu-l and awgu April 25, 2024 01:23

awgu approved these changes Apr 25, 2024

View reviewed changes

tianyu-l reviewed Apr 25, 2024

View reviewed changes

torchtitan/parallelisms/parallelize_llama.py Outdated Show resolved Hide resolved

torchtitan/parallelisms/parallelize_llama.py Outdated Show resolved Hide resolved

wanchaol requested a review from tianyu-l April 26, 2024 00:15

wanchaol added 5 commits April 25, 2024 21:33

torch.compile each TransformerBlock instead of the whole model

6595938

This way we could temporarily enable 2-D parallel compile, and it make more sense to do transformer block compile in the future with PP anyways. We should figure out the dynamic shape issue though

lint

68157d4

comment

dc394b2

order

a03877b

lint

c74961a

tianyu-l approved these changes Apr 26, 2024

View reviewed changes

group AC + torch.compile together

8dc90b0

wanchaol force-pushed the compile_2d branch from 4f77fdc to 8dc90b0 Compare April 26, 2024 04:59

refactor and reorganize more

90429de

tianyu-l reviewed Apr 26, 2024

View reviewed changes

wanchaol added 6 commits April 25, 2024 22:08

fix typo

c887859

lint

640ebbc

more fixes

2e6f1cf

fix logging

6eb6cf4

explicitly throw if use fused_rmsnorm + compile to avoid surprise

9e1e08d

fix bug around AC assign submodule

5045706

awgu reviewed Apr 26, 2024

View reviewed changes

temp changes

ce8e480

yifuwang reviewed May 3, 2024

View reviewed changes

wanchaol added 5 commits May 7, 2024 21:27

Merge branch 'main' into compile_2d

50ceae8

Merge branch 'main' into compile_2d

6460a48

try out reuse cache flag

11e02fe

Merge branch 'main' into compile_2d

962a1e1

disable inline_nn_modules

bf5c857

wanchaol force-pushed the compile_2d branch from 5e209cc to bf5c857 Compare May 22, 2024 04:28

fix lint

81dc9e3

wanchaol merged commit 60810a9 into main May 22, 2024
4 checks passed

wanchaol deleted the compile_2d branch May 22, 2024 05:10

tianyu-l mentioned this pull request Jun 13, 2024

FSDP + SP does not work with --compile #61

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

torch.compile each TransformerBlock instead of the whole model #268

torch.compile each TransformerBlock instead of the whole model #268

wanchaol commented Apr 25, 2024 •

edited

Loading

awgu left a comment

tianyu-l left a comment

tianyu-l Apr 26, 2024

tianyu-l Apr 26, 2024

tianyu-l Apr 26, 2024

awgu Apr 26, 2024

awgu commented Apr 26, 2024

wanchaol commented Apr 26, 2024

wanchaol commented Apr 26, 2024

kwen2501 commented May 1, 2024

chrisociepa commented May 2, 2024

wanchaol commented May 2, 2024

chrisociepa commented May 3, 2024

yifuwang May 3, 2024

wanchaol May 3, 2024

wanchaol commented May 22, 2024 •

edited

Loading

torch.compile each TransformerBlock instead of the whole model #268

torch.compile each TransformerBlock instead of the whole model #268

Conversation

wanchaol commented Apr 25, 2024 • edited Loading

awgu left a comment

Choose a reason for hiding this comment

tianyu-l left a comment

Choose a reason for hiding this comment

tianyu-l Apr 26, 2024

Choose a reason for hiding this comment

tianyu-l Apr 26, 2024

Choose a reason for hiding this comment

tianyu-l Apr 26, 2024

Choose a reason for hiding this comment

awgu Apr 26, 2024

Choose a reason for hiding this comment

awgu commented Apr 26, 2024

wanchaol commented Apr 26, 2024

wanchaol commented Apr 26, 2024

kwen2501 commented May 1, 2024

chrisociepa commented May 2, 2024

wanchaol commented May 2, 2024

chrisociepa commented May 3, 2024

yifuwang May 3, 2024

Choose a reason for hiding this comment

wanchaol May 3, 2024

Choose a reason for hiding this comment

wanchaol commented May 22, 2024 • edited Loading

wanchaol commented Apr 25, 2024 •

edited

Loading

wanchaol commented May 22, 2024 •

edited

Loading