what is gradient_checkpointing_block_size ? #446

tjingrant · 2024-02-02T01:04:51Z

What is this knob doing?
I couldn't find where this is used. It exists in many config files:

Line 63 in 5fb6767

gradient_checkpointing_block_size: int = 5

Would be nice to have documentation about this.

dlwh · 2024-02-02T07:27:51Z

It's vestigial and doesn't do anything atm, but also should be brought back... #427 has some beginnings of an investigation about bringing it back.

ATM we remember the outputs of all num_layers activations (but not that attention matrices or mlp intermediates), which makes rematerializing the needed information for that backward pass take O(num_layers) time, which is good, but you
can actually still get O(num_layers) time while remembering only O(sqrt(num_layers)) activations, which could be a huge memory savings in a big model. The block size refers to the number of layers that should be grouped together for this caching.

I didn't explain it very well... here's a longer explanation https://github.com/cybertronai/gradient-checkpointing

I'm gonna close this in favor of #427

dlwh closed this as completed Feb 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

what is gradient_checkpointing_block_size ? #446

what is gradient_checkpointing_block_size ? #446

tjingrant commented Feb 2, 2024

dlwh commented Feb 2, 2024

what is gradient_checkpointing_block_size ? #446

what is gradient_checkpointing_block_size ? #446

Comments

tjingrant commented Feb 2, 2024

dlwh commented Feb 2, 2024