Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow for finer-grained control of concurrency in file transfers #750

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

moradology
Copy link
Contributor

@moradology moradology commented May 30, 2024

This PR enables transfer of files with two knobs to manage concurrency.
max_executors tells the transform how many groups of URLs to create (which should generally set an upper bound on the number of required worker nodes in the cluster)
concurrency_per_executor tells the system how many URLs per group can be concurrently opened for file transfer at maximum.

Total concurrency across the cluster will, of course, be a function of these two values with a theoretical maximum of max_executors * concurrency_per_executor

@jbusecke
Copy link
Contributor

jbusecke commented Jun 3, 2024

Thanks so much for this @moradology. Testing this currently in leap-stc/climsim_feedstock#7

@jbusecke
Copy link
Contributor

jbusecke commented Jun 4, 2024

While I am testing this I wanted to start a discussion about how this would be used/documented (and furthermore how/if injection would work here).

So my main question is: Once this works, should we entirely replace the caching option in OpenWithFSSpec? I wonder if there is any advantage in using that instead of this stage? Maybe somebody here has a usecase that would break? From my POV it is much more explicit and easy to understand with a separate stage. If we decide to go this route I would opt for a more intuitive name though. How about simply 'CacheFiles'?

Whether we go ahead with this as an optional stage or a replacement for the cache=... option in OpenWithFSSpec the way this is currently set up we are kind of breaking the paradigm of 'the recipe should not contain any information about the storage' - which is usually passed as config to runner.
Can we have a chat if we want to adapt the injection logic, so that I can define my cache target in the config like before (In my testing I needed to move the specification of the cache target to the recipe see leap-stc/climsim_feedstock#7).

@jbusecke
Copy link
Contributor

jbusecke commented Jun 4, 2024

Also testing this over at leap-stc/cmip6-leap-feedstock#170

@moradology
Copy link
Contributor Author

The only reason I can see to keep caching around in OpenURLWithFSSpec is to maintain API compatibility - and then, only for a limited time. I personally favor being explicit about things like this to avoid painful contortions when expectations/needs change in ways we don't anticipate.

As for recipes not containing information about the specific storage location, I think your intuition is correct. There is some discussion here which may be of interest. I'm not sure there's strong consensus about how to approach things though, for my part, I'd prefer to see the dependency injection either abandoned (and people just fork/update recipes as needed) or facilitated via the (for my mind, at least) easier to reason about mechanism of python functions

@moradology moradology force-pushed the feature/concurrency-control branch from 54a0b10 to 487f039 Compare June 4, 2024 19:21
@jbusecke
Copy link
Contributor

jbusecke commented Jun 5, 2024

The only reason I can see to keep caching around in OpenURLWithFSSpec is to maintain API compatibility - and then, only for a limited time. I personally favor being explicit about things like this to avoid painful contortions when expectations/needs change in ways we don't anticipate.

That seems like a great way forward. Maybe start a deprecation cycle by implementing a warning when cache is not None in OpenWithFsspec?

As for recipes not containing information about the specific storage location, I think your intuition is correct. There is some discussion here which may be of interest. I'm not sure there's strong consensus about how to approach things though, for my part, I'd prefer to see the dependency injection either abandoned (and people just fork/update recipes as needed) or facilitated via the (for my mind, at least) easier to reason about mechanism of python functions

Curious to discuss this further in the coming weeks. I have to admint that I do not 100% understand that issue, but curious to learn more.

@jbusecke
Copy link
Contributor

jbusecke commented Jun 5, 2024

Here is a suggestion to further improve the performance of this stage for recipes with a ton of files. Curious to hear if this would add too much complexity.

  • Can we decouple the 'check cache' and actual 'transfer/download to cache'?

The motivation here is that for the transfer we are limited by external factors (when does the server think we are DDOSing it), but for checking if files exist we can probably have a much higher concurency.

So currently we are doing (in an abstract way) this in every executor:

cached_files = 
#limit to external concurrency
for file in file_group:
   cached_url = check_and_cache(file)
   cached_files.append(cached_url)

Could we modify it to something like:

already_cached_files = []
needs_caching_files = []

# much higher concurrency
for file in file_group:
    status = check(file)
    if status='cached':
       already_cached_files.append(file)
    elif status='not_cached':
        needs_caching_files.append(file)

# limit to external concurrency
cached_files = []
for file in needs_caching_files:
    cached_url = cache(file)
   cached_files.append(cached_url)

return already_cached_files + cached_files # this might mess up order, not sure how we would handle that.

@moradology
Copy link
Contributor Author

moradology commented Jun 10, 2024

I'm thinking a simple transform to filter down already-transferred URLs is sensible ahead of more costly file transfers.

Here's a flag that should pre-filter as desired: https://github.com/pangeo-forge/pangeo-forge-recipes/pull/750/files#diff-8bac120398898793cd4f9daf94551b1f3d3f1867bed8a68b14cceed49d6dc30fR220

@moradology moradology force-pushed the feature/concurrency-control branch from 91a217e to a851610 Compare June 10, 2024 17:09
@moradology moradology force-pushed the feature/concurrency-control branch from 59c4579 to 1c9321a Compare June 12, 2024 19:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants