Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pytorch dataset: pass cache/prefetch to DataChain instances #653

Merged
merged 1 commit into from
Dec 4, 2024

Conversation

skshetry
Copy link
Member

@skshetry skshetry commented Dec 2, 2024

This should enable prefetching on to_pytorch API. We were not passing any settings to the DataChain instances created inside to_pytorch, due to which it was not using cache or prefetch.

I have refrained from setting workers, etc. because I believe that is better left to PytorchDataLoader for now.

I have found this PR to increase performance for torch-loader.py example by 20-25% when prefetch and cache is enabled. However, this was done on my machine and was not a scientific measurement.

Closes #631.

This should enable prefetching for pytorch datasets.

I have found it to increase performance for `torch-loader.py` example
by 20-25% when prefetch and cache is enabled.

However, this was done on my machine and was not a scientific
measurement.
@skshetry skshetry requested a review from a team December 2, 2024 13:05
@skshetry skshetry changed the title pytorch dataset: pass cache/prefetch to DataChain constructor pytorch dataset: pass cache/prefetch to DataChain instances Dec 2, 2024
Copy link

codecov bot commented Dec 2, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 87.67%. Comparing base (692c8dc) to head (e258e9f).
Report is 4 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #653   +/-   ##
=======================================
  Coverage   87.67%   87.67%           
=======================================
  Files         111      111           
  Lines       10601    10603    +2     
  Branches     1436     1436           
=======================================
+ Hits         9294     9296    +2     
  Misses        945      945           
  Partials      362      362           
Flag Coverage Δ
datachain 87.61% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@shcheklein
Copy link
Member

@skshetry could you clarify the scope for the #631 based on the previous meetings discussion? (does this PR close it for example or we had more things in mind?). Do we need more testing, research how it works or not?

Copy link
Contributor

@dreadatour dreadatour left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! 👍

@skshetry
Copy link
Member Author

skshetry commented Dec 4, 2024

@skshetry could you clarify the scope for the #631 based on the previous meetings discussion? (does this PR close it for example or we had more things in mind?). Do we need more testing, research how it works or not?

I was considering whether we need prefetching at all, given that pytorch handles some of this internally. However, in the example I mentioned earlier, a significant amount of time is spent downloading files.

(Even with prefetch_factor set, it still is going to download in sequence, downloading while the iterator gets consumed).

When running with cache=True, I see that the DataLoader processes the dataset (200 files) in under a second, compared to around 30 seconds without caching. So the potential is there for improvement. That said, I haven’t noticed any meaningful improvements with prefetch alone so far.

I’m currently experimenting with a dataset containing a larger number of files to test the hypothesis that prefetching improves performance.

While prefetching should theoretically enhance DataLoader performance, I haven’t yet observed any tangible gains (see related issue: #635).


prefetch + cache does help though on successive invocations due to caching, it is how prefetch is implemented at the moment anyway. So I think we should pass these configs to to_pytorch, and research performance issue separately.

@skshetry skshetry merged commit 325b7b3 into main Dec 4, 2024
38 checks passed
@skshetry skshetry deleted the prefetch-pytorch branch December 4, 2024 16:01
This was referenced Dec 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

to_pytorch prefetching
3 participants