You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.
Would it be possible to get a confirmation on the emb_dim parameter value used for training the BERT model on the original XLM paper? I am trying to measure its effect on accuracy, GPU memory and training time, but the 2048 value suggested on the README always fails to improve after a few epochs (512 and 1024 have no issue increasing).
For reference, in the paper, section 5.1 Training details, it says "we use a Transformer architecture with 1024 hidden units", but both the README and issue #112 suggest using 2048.
Thanks,
Alfredo
The text was updated successfully, but these errors were encountered:
Greetings,
Would it be possible to get a confirmation on the
emb_dim
parameter value used for training the BERT model on the original XLM paper? I am trying to measure its effect on accuracy, GPU memory and training time, but the 2048 value suggested on the README always fails to improve after a few epochs (512 and 1024 have no issue increasing).For reference, in the paper, section 5.1 Training details, it says "we use a Transformer architecture with 1024 hidden units", but both the README and issue #112 suggest using 2048.
Thanks,
Alfredo
The text was updated successfully, but these errors were encountered: