An extension of PyTorch IterableDataset, this package introduces functionalities for shuffling, limiting, and offsetting data.
Directly from PyPI:
pip install torch-exid
Or using Poetry:
poetry add torch-exid
Begin by subclassing ExtendedIterableDataset
and implement the generator
method to yield items.
Here's a simple example using an IntegersDataset
:
from torch_exid import ExtendedIterableDataset
class IntegersDataset(ExtendedIterableDataset):
def generator(self) -> Iterator[int]:
n = 0
while True:
yield n
n += 1
# Will print out integers 0, 1, ..., 9:
for n in IntegersDataset(limit=10):
print(n)
ExtendedIterableDataset
introduces several parameters to provide additional control:
Sets the maximum number of data points to return. If negative, all data points are returned. Default is -1
(return all data).
# Will print out "0, 1, 2"
for n in IntegersDataset(limit=3)
print(n)
Determines the number of initial data points to skip. Default is 0
.
# Will print out "2, 3, 4"
for n in IntegersDataset(limit=3, offset=2)
print(n)
This specifies the buffer size for shuffling. If greater than 1
, data is buffered and shuffled prior to being returned. If set to 1
(default), no shuffling occurs.
# Will print out "0, 1, 3, 2" for the first time...
for n in IntegersDataset(limit=4, shuffle_buffer=2)
print(n)
# ...and 1, 0, 2, 3 second time
for n in IntegersDataset(limit=4, shuffle_buffer=2)
print(n)
Defines the seed for the random number generator used in shuffling. If not provided, a random seed is used:
# Will print out "1, 0, 3, 2" both times:
for n in IntegersDataset(limit=4, shuffle_buffer=2, shuffle_seed=42)
print(n)
for n in IntegersDataset(limit=4, shuffle_buffer=2, shuffle_seed=42)
print(n)
A list of transformations to apply to the data. Default is an empty list.
ds = IntegersDataset(
limit=3,
transforms=[
lambda n: n + 1,
lambda n: n ** 2,
],
)
# Will print out "1, 4, 9"
for n in ds:
print(n)
In addition to the above, any arguments or keyword arguments for the IterableDataset superclass can also be passed.
This method allows the skipping of the next item that would be yielded by the generator
. Using skip_next
will not affect the limit
or offset
.
class EvensDataset(ExtendedIterableDataset):
def generator(self) -> Iterator[int]:
n = 0
while True:
if n % 2 != 0:
self.skip_next()
yield n
n += 1
ds = EvensDataset(limit=5)
# Will print out "0, 2, 4, 6, 8"
for n in ds:
print(n)
In other words, it allows you to bypass the next item without modifying the overall iteration parameters.
Contributions are greatly appreciated! Improvement can be made by submitting issues, proposing new features, or submitting pull requests with bug fixes or new functionalities.
Here are the steps to get started with development:
# Clone the repository:
git clone https://github.com/arlegotin/torch_exid.git
cd torch_exid
# Install the project and its dependencies using Poetry:
poetry install
# Spawn a shell within the virtual environment:
poetry shell
# Run tests to ensure everything is working correctly:
pytest tests/
Please ensure all changes are accompanied by relevant unit tests, and that all tests pass before submitting a pull request. This helps maintain the quality and reliability of the project.