Only pad documents to largest documents in the column #25

NohTow · 2024-08-01T16:44:01Z

Right now, for simplicity, during distillation we pad every document to the max length so we can easily stack them to compute the scores.
An optimization would be to only pad them to the longest in the column (since we are proceeding the documents column by column) and add the padding afterwards

NohTow · 2024-08-09T10:25:04Z

Update: we are now working on the batch level so we can only pad to the longest document in the whole batch (instead of to max_doc_length). This can be achieved by simply setting pad_document to false in the tokenize function defined in the collator.
However, I am letting it to true as the default right now since I surprisingly observed more VRAM usage when setting it to false (and also need to bench its performance impact w.r.t the .compile function).

NohTow added the enhancement New feature or request label Aug 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Only pad documents to largest documents in the column #25

Only pad documents to largest documents in the column #25

NohTow commented Aug 1, 2024

NohTow commented Aug 9, 2024

Only pad documents to largest documents in the column #25

Only pad documents to largest documents in the column #25

Comments

NohTow commented Aug 1, 2024

NohTow commented Aug 9, 2024