The default model config is now narrower, and the model trains stably. Key changes: add an epsilon inside sqrt to avoid nan gradients, adjust the range of forget_base
to be closer to the paper.
The default model config is now narrower, and the model trains stably. Key changes: add an epsilon inside sqrt to avoid nan gradients, adjust the range of forget_base
to be closer to the paper.