You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The problem is that if batch_size is anywhere near the size of the dataset, we could well read beyond that epoch. The output becomes more than you'd expect. If no errors occur with the modifiers, the output will always be a multiple of batch_size.
One way to solve this is to put the Epoch tracker around the iterator directly, so it will stop generating once it has reached its epoch limit. That batch then might be smaller, but that's okay.
Another thing I'd like to take a look at is that we generate batch_size * dataset_weight items per dataset. If batch_size is small, this doesn't work very well.
The text was updated successfully, but these errors were encountered:
I'm also saw int(batch_size * weight) and came to open an issue. At least one test calls for a batch size of 1. Any dataset with a weight $w<1$ is does not contribute to the batch. The only reason this test passes is because it is the only dataset in the test, and has a weight of 1. Reducing this sole weight below 1 causes the test to hang, because the dataset never advances and triggers the until epoch condition.
In a more real-world scenario, with multiple datasets, we may miss this if the dataset is never part of an until condition because we do not impose any checks on batch-size.
Since this is ultimately a sampling issue, I'll also add that we should be normalising weights where possible. This would also allow for dataset ratios of 2:1 to behave properly, but also make our sampling more robust.
In this bit of code, we use
islice
to readbatch_size
elements from the dataset if the dataset has not yet read past its epoch (for this stage):OpusTrainer/src/opustrainer/trainer.py
Lines 629 to 635 in b6355ae
The problem is that if
batch_size
is anywhere near the size of the dataset, we could well read beyond that epoch. The output becomes more than you'd expect. If no errors occur with the modifiers, the output will always be a multiple ofbatch_size
.One way to solve this is to put the Epoch tracker around the iterator directly, so it will stop generating once it has reached its epoch limit. That batch then might be smaller, but that's okay.
Another thing I'd like to take a look at is that we generate
batch_size * dataset_weight
items per dataset. Ifbatch_size
is small, this doesn't work very well.The text was updated successfully, but these errors were encountered: