backpropgation on chunks? #14

vr25 · 2020-10-02T15:12:24Z

Hi,

When the document chunks are fed to the data parallel model, how is the loss backpropagated? Is it for every chunk?

Also, do you unfreeze and fine-tune for the classification task?

Thank you!

AndriyMulyar · 2020-10-05T01:01:19Z

1. Yes, separate for every chunk. 2. In our datasets we found it sufficient to fine-tune only the final transformer layer.

…

On Fri, Oct 2, 2020, 11:12 AM Vipula Rawte ***@***.***> wrote: Hi, When the document chunks are fed to the data parallel model, how is the loss backpropagated? Is it for every chunk? Also, do you unfreeze and fine-tune for the classification task? Thank you! — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#14>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADJ4TBT57F53RQMV4ACPIJTSIXUWRANCNFSM4SB2AJKA> .

vr25 · 2020-10-05T02:44:10Z

More explanation on how loss is calculated for every chunk separately? I mean the entire document has a target label and so AFAIU, the loss would be calculated for this target, right? Please let me know if I am missing something.

Also, what is the maximum number of chunks in the entire dataset?

The default config has bert_batch_size=7 but I have some documents with a total number of chunks=125 per document. In such cases, if I set bert_batch_size to 125, I run into CUDA OOM error.

Any suggestions for this?

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

backpropgation on chunks? #14

backpropgation on chunks? #14

vr25 commented Oct 2, 2020

AndriyMulyar commented Oct 5, 2020 via email

vr25 commented Oct 5, 2020 •

edited

Loading

backpropgation on chunks? #14

backpropgation on chunks? #14

Comments

vr25 commented Oct 2, 2020

AndriyMulyar commented Oct 5, 2020 via email

vr25 commented Oct 5, 2020 • edited Loading

vr25 commented Oct 5, 2020 •

edited

Loading