You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It's vestigial and doesn't do anything atm, but also should be brought back... #427 has some beginnings of an investigation about bringing it back.
ATM we remember the outputs of all num_layers activations (but not that attention matrices or mlp intermediates), which makes rematerializing the needed information for that backward pass take O(num_layers) time, which is good, but you
can actually still get O(num_layers) time while remembering only O(sqrt(num_layers)) activations, which could be a huge memory savings in a big model. The block size refers to the number of layers that should be grouped together for this caching.
What is this knob doing?
I couldn't find where this is used. It exists in many config files:
levanter/src/levanter/models/gpt2.py
Line 63 in 5fb6767
Would be nice to have documentation about this.
The text was updated successfully, but these errors were encountered: