Improve advanced KubeCluster customization API #899

jacobtomlinson · 2024-07-23T13:14:20Z

Background

When creating a new KubeCluster the __init__()/_start() process happens in a couple of steps.

Generate a dictionary of the Kubernetes spec from provided kwargs using make_cluster_spec()
Create a DaskCluster class (which was created by kr8s.asyncio.objects.make_class()) from that dictionary
Call DaskCluster.create()

If users want to customize their cluster spec before creating it they can use make_cluster_spec() directly which takes the same kwargs as KubeCluster and returns the dictionary that is generated in step 1, then they can modify the dict before passing it directly to step 2 via the kwarg KubeCluster(custom_cluster_spec=...) .

Problem statement

When it comes to adding new kwargs to make_cluster_spec() we are very conservative because it is easy for that method to become extremely overwhelming. However as highlighted in #898 there may be customizations we want to do regularly like setting the memory limit on the dask worker CLI args which is not very ergonomic to do by modifying the dict directly. It would be nice of the object returned from make_cluster_spec() was easier to customize than a simple dictionary.

For example

from dask_kubernetes.operator import KubeCluster, make_cluster_spec

cluster_spec = make_cluster_spec("foo", ...)

# Currently we do this
cluster_spec["spec"]["worker"]["spec"]["containers"][0]["args"] += ["--memory-limit", "4GB"]
# It would be nicer if we could do something like this instead
# cluster_spec.set_worker_memory_limit("4GB")

cluster = KubeCluster(custom_cluster_spec=cluster_spec)

Option 1

Instead of returning a dict from make_cluster_spec() we could return the DaskCluster object from step 2. This object is dict-like so any existing code that modifies the dict should work as expected (although this will need to be verified).

Pros

The existing DaskCluster class is a convenient place to add methods to simplify some of the common customizations. It will be created anyway, we would just be changing the step at which the user can intervene from 1 to 2.

Cons

The DaskCluster object that would be returned by make_cluster_spec() would be an async kr8s object and potentially KubeCluster is being used in a sync context, and we would then be extending it with sync methods to modify things.

Returning the DaskCluster object leaks the implementation detail that we are using kr8s to interact with Kubernetes.

It also means in theory a user could call await cluster_spec.create() which would create the Dask cluster, but it wouldn't handle any of the port forwarding, log retrieval, etc and couldn't be passed to dask.distributed.Client.

Having KubeCluster and DaskCluster, which serve different purposes but have similar names, may cause confusion with users.

Option 2

We could introduce some new intermediate object with a name like KubeClusterSpec which would effectively be a subclass of dict but with whatever convenience methods we want. We would return this from make_cluster_spec().

We would need to ensure that DaskCluster can be passed a KubeClusterSpec instead of a dict. There is already support in kr8s for passing a few different things, so we should be able to just conform to one of those.

Pros

Adding a new class would avoid any user confusion as users would continue to never interact with kr8s resources directly.

The purpose of the new class is well scoped, it is a dictionary of the spec with some utility methods to simplify modifying the spec.

Cons

Introducing a new class into the chain increases complexity.

The text was updated successfully, but these errors were encountered:

droctothorpe · 2024-07-23T15:09:25Z

As a single data point, I prefer option 2. Internally, we have a DaskClusterConfig class with a spec method that just returns a dictionary in the format that custom_cluster_spec expects.

This is nice because it lets you configure your cluster in an object oriented way instead of manipulating a complex, nested dictionary. It provides sensible defaults but you can easily override them. Everything is a property so you can instantiate first then set specific properties later. It pretty prints your config and even creates an interactive ipywidget when executed within a Jupyter notebook. It was inspired by Dask Gateway's Options class.

jacobtomlinson · 2024-07-23T16:58:54Z

Yeah I think I'm leaning that way too.

jacobtomlinson added enhancement idea labels Jul 23, 2024

jacobtomlinson mentioned this issue Jul 24, 2024

xgboost "Allreduce failed" with the dask operator specifically #898

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve advanced KubeCluster customization API #899

Improve advanced KubeCluster customization API #899

jacobtomlinson commented Jul 23, 2024 •

edited

Loading

droctothorpe commented Jul 23, 2024 •

edited

Loading

jacobtomlinson commented Jul 23, 2024

Improve advanced KubeCluster customization API #899

Improve advanced KubeCluster customization API #899

Comments

jacobtomlinson commented Jul 23, 2024 • edited Loading

Background

Problem statement

Option 1

Pros

Cons

Option 2

Pros

Cons

droctothorpe commented Jul 23, 2024 • edited Loading

jacobtomlinson commented Jul 23, 2024

jacobtomlinson commented Jul 23, 2024 •

edited

Loading

droctothorpe commented Jul 23, 2024 •

edited

Loading