pytorch dataset: pass cache/prefetch to DataChain instances #653

skshetry · 2024-12-02T13:05:08Z

This should enable prefetching on to_pytorch API. We were not passing any settings to the DataChain instances created inside to_pytorch, due to which it was not using cache or prefetch.

I have refrained from setting workers, etc. because I believe that is better left to PytorchDataLoader for now.

I have found this PR to increase performance for torch-loader.py example by 20-25% when prefetch and cache is enabled. However, this was done on my machine and was not a scientific measurement.

Closes #631.

This should enable prefetching for pytorch datasets. I have found it to increase performance for `torch-loader.py` example by 20-25% when prefetch and cache is enabled. However, this was done on my machine and was not a scientific measurement.

codecov · 2024-12-02T13:16:14Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 87.67%. Comparing base (692c8dc) to head (e258e9f).
Report is 4 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #653   +/-   ##
=======================================
  Coverage   87.67%   87.67%           
=======================================
  Files         111      111           
  Lines       10601    10603    +2     
  Branches     1436     1436           
=======================================
+ Hits         9294     9296    +2     
  Misses        945      945           
  Partials      362      362

Flag	Coverage Δ
datachain	`87.61% <100.00%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

shcheklein · 2024-12-02T17:38:05Z

@skshetry could you clarify the scope for the #631 based on the previous meetings discussion? (does this PR close it for example or we had more things in mind?). Do we need more testing, research how it works or not?

dreadatour

Looks good to me! 👍

skshetry · 2024-12-04T05:57:11Z

@skshetry could you clarify the scope for the #631 based on the previous meetings discussion? (does this PR close it for example or we had more things in mind?). Do we need more testing, research how it works or not?

I was considering whether we need prefetching at all, given that pytorch handles some of this internally. However, in the example I mentioned earlier, a significant amount of time is spent downloading files.

(Even with prefetch_factor set, it still is going to download in sequence, downloading while the iterator gets consumed).

When running with cache=True, I see that the DataLoader processes the dataset (200 files) in under a second, compared to around 30 seconds without caching. So the potential is there for improvement. That said, I haven’t noticed any meaningful improvements with prefetch alone so far.

I’m currently experimenting with a dataset containing a larger number of files to test the hypothesis that prefetching improves performance.

While prefetching should theoretically enhance DataLoader performance, I haven’t yet observed any tangible gains (see related issue: #635).

prefetch + cache does help though on successive invocations due to caching, it is how prefetch is implemented at the moment anyway. So I think we should pass these configs to to_pytorch, and research performance issue separately.

skshetry requested a review from a team December 2, 2024 13:05

skshetry changed the title ~~pytorch dataset: pass cache/prefetch to DataChain constructor~~ pytorch dataset: pass cache/prefetch to DataChain instances Dec 2, 2024

shcheklein approved these changes Dec 2, 2024

View reviewed changes

dreadatour approved these changes Dec 3, 2024

View reviewed changes

skshetry merged commit 325b7b3 into main Dec 4, 2024
38 checks passed

skshetry deleted the prefetch-pytorch branch December 4, 2024 16:01

This was referenced Dec 5, 2024

to_pytorch prefetching #631

Closed

to_pytorch: enable prefetching #664

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pytorch dataset: pass cache/prefetch to DataChain instances #653

pytorch dataset: pass cache/prefetch to DataChain instances #653

skshetry commented Dec 2, 2024

codecov bot commented Dec 2, 2024 •

edited

Loading

shcheklein commented Dec 2, 2024

dreadatour left a comment

skshetry commented Dec 4, 2024 •

edited

Loading

pytorch dataset: pass cache/prefetch to DataChain instances #653

pytorch dataset: pass cache/prefetch to DataChain instances #653

Conversation

skshetry commented Dec 2, 2024

codecov bot commented Dec 2, 2024 • edited Loading

Codecov Report

shcheklein commented Dec 2, 2024

dreadatour left a comment

Choose a reason for hiding this comment

skshetry commented Dec 4, 2024 • edited Loading

codecov bot commented Dec 2, 2024 •

edited

Loading

skshetry commented Dec 4, 2024 •

edited

Loading