Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

what is gradient_checkpointing_block_size ? #446

Closed
tjingrant opened this issue Feb 2, 2024 · 1 comment
Closed

what is gradient_checkpointing_block_size ? #446

tjingrant opened this issue Feb 2, 2024 · 1 comment

Comments

@tjingrant
Copy link

What is this knob doing?
I couldn't find where this is used. It exists in many config files:

gradient_checkpointing_block_size: int = 5

Would be nice to have documentation about this.

@dlwh
Copy link
Member

dlwh commented Feb 2, 2024

It's vestigial and doesn't do anything atm, but also should be brought back... #427 has some beginnings of an investigation about bringing it back.

ATM we remember the outputs of all num_layers activations (but not that attention matrices or mlp intermediates), which makes rematerializing the needed information for that backward pass take O(num_layers) time, which is good, but you
can actually still get O(num_layers) time while remembering only O(sqrt(num_layers)) activations, which could be a huge memory savings in a big model. The block size refers to the number of layers that should be grouped together for this caching.

I didn't explain it very well... here's a longer explanation https://github.com/cybertronai/gradient-checkpointing

I'm gonna close this in favor of #427

@dlwh dlwh closed this as completed Feb 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants