-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
presimple_toyzero in memory dataset #16
base: master
Are you sure you want to change the base?
Conversation
Add functions to detect multi-track windows
Extract script added
Hi @YHRen. I was wondering if class PrefetchingDecorator(GenericDataset):
def __init__(self, dset):
self._dset = dset
self._shared_data = self._preload_data()
...
def __getitem__(self, index):
return [self._shared_data[index][0], self._shared_data[index][1]] And use |
Also, FYI, we already have a dataset that is designed for superfast handling of crops -- |
Absolutely! Great idea. Feel free to change accordingly.
I haven't looked into this. Yi pointed me to the presimple dataset file she is using. After briefly looking into it, the loading is similar but loads a 128x128 uncompressed file from filesystem? In any case, let me know if this PR is erroneous. If you think it is ok, we should let Yi to switch to this one to save some training time. |
I do not see any obvious errors with this PR. Sure, Yi is welcome to use the new dataset.
Linux maintains a single cache per filesystem, so there won't be any cache duplication between processes. I am happily using it with 1:1 CPU:GPU ratio and the CPU is never a bottleneck. |
Motivation:
To preload entire dataset into memory, and shareable among worker processes.
Currently, loading each
npz
file, link, involves unzip and memory allocation (~5 MiB). We should do this once and store the resulting 128x128 (64 KiB) in memory. This will be beneficial for training and testing on small datasets (~64GB for 1M).Here is a brief testing and profiling.
Test and Profiling
```python from toytools.datasets import PreSimpleToyzeroDataset ```n100-U-128x128_sample_512.csv
is top 512 examples in the 50k dataset.p.s. I also removed the
__pycache__
folder.