Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

For some reasons Dask Clusters are able to be created on one namespace but not another, despite the same setup #903

Open
capoolebugchat opened this issue Aug 23, 2024 · 2 comments
Labels
bug needs info Needs further information from the user

Comments

@capoolebugchat
Copy link

Hi gang. So I just managed to install Dask-Operator on a K8S cluster with the necessary Dask CRDs. To try some stuffs out I decided to use Jupyter and prop-up my own Jupyter Lab server running in a namespace, set up the service account for it, ensure that it works and finally installed Dask and dask-kubernetes.
I tried creating a toy dask cluster and it works.
So I decided to replicate the whole yaml set up with the replacing of the namespace, to access some name-space bounded resources. However in the new namespace, the cluster scheduler-pod spawned, but when it comes to scaling the workers the Jupyter output show some really strange error (below).
I've exhausted my Google search and nothing came up. Hope you can point me to a right direction.

NotFoundError                             Traceback (most recent call last)
Cell In[85], line 5
      3 # KubeCluster?
      4 cluster = KubeCluster(name="my-dask-cluster", namespace="dask-jobs-ns", create_mode=CreateMode.CREATE_ONLY)
----> 5 cluster.scale(2)

File /opt/conda/lib/python3.10/site-packages/dask_kubernetes/operator/kubecluster/kubecluster.py:729, in KubeCluster.scale(self, n, worker_group)
    713 def scale(self, n, worker_group="default"):
    714     """Scale cluster to n workers
    715 
    716     Parameters
   (...)
    726     >>> cluster.scale(7, worker_group="high-mem-workers") # scale worker group high-mem-workers to seven workers
    727     """
--> 729     return self.sync(self._scale, n, worker_group)

File /opt/conda/lib/python3.10/site-packages/distributed/utils.py:358, in SyncMethodMixin.sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
    356     return future
    357 else:
--> 358     return sync(
    359         self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
    360     )

File /opt/conda/lib/python3.10/site-packages/distributed/utils.py:434, in sync(loop, func, callback_timeout, *args, **kwargs)
    431         wait(10)
    433 if error is not None:
--> 434     raise error
    435 else:
    436     return result

File /opt/conda/lib/python3.10/site-packages/distributed/utils.py:408, in sync.<locals>.f()
    406         awaitable = wait_for(awaitable, timeout)
    407     future = asyncio.ensure_future(awaitable)
--> 408     result = yield future
    409 except Exception as exception:
    410     error = exception

File /opt/conda/lib/python3.10/site-packages/tornado/gen.py:767, in Runner.run(self)
    765 try:
    766     try:
--> 767         value = future.result()
    768     except Exception as e:
    769         # Save the exception for later. It's important that
    770         # gen.throw() not be called inside this try/except block
    771         # because that makes sys.exc_info behave unexpectedly.
    772         exc: Optional[Exception] = e

File /opt/conda/lib/python3.10/site-packages/dask_kubernetes/operator/kubecluster/kubecluster.py:740, in KubeCluster._scale(self, n, worker_group)
    735     await autoscaler.delete()
    737 wg = await DaskWorkerGroup(
    738     f"{self.name}-{worker_group}", namespace=self.namespace
    739 )
--> 740 await wg.scale(n)
    741 for instance in self._instances:
    742     if instance.name == self.name:

File /opt/conda/lib/python3.10/site-packages/kr8s/_objects.py:307, in APIObject.scale(self, replicas)
    305 if not self.scalable:
    306     raise NotImplementedError(f"{self.kind} is not scalable")
--> 307 await self._exists(ensure=True)
    308 await self._patch({"spec": dot_to_nested_dict(self.scalable_spec, replicas)})
    309 while self.replicas != replicas:

File /opt/conda/lib/python3.10/site-packages/kr8s/_objects.py:227, in APIObject._exists(self, ensure)
    225     return True
    226 if ensure:
--> 227     raise NotFoundError(f"Object {self.name} does not exist")
    228 return False

**NotFoundError: Object my-dask-cluster-default does not exist**

Environment:

  • Dask version: 2023.5.0
  • Python version:
  • Operating System:
  • Install method (conda, pip, source): pip
@capoolebugchat
Copy link
Author

capoolebugchat commented Aug 23, 2024

UPDATE: as a Hail Mary I shutdown the Jupyter Kernel and this resolved. I won't be closing for now this since I think this could be a problem with freshly installed Dasks.

UPDATE: it came back, restarting the kernel is not a hack...

@jacobtomlinson
Copy link
Member

jacobtomlinson commented Aug 23, 2024

Ok so just to check, you can successfully create Dask clusters in the same namespace that your Jupyter Pod is running. But when you try and create clusters in other namespaces it fails?

My guess is that there is a bug causing the "dask-jobs-ns" namespace setting to be dropped somewhere and it's trying to look up something in the current namespace and failing.

To test this hypothesis could you change your default namespace to dask-jobs-ns, restart and kernel and try again. My expectation is that creating a cluster in the dask-jobs-ns works, but creating in the current namespace no longer works.

kubectl config set-context dask-jobs --namespace=dask-jobs-ns
kubectl config use-context dask-jobs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug needs info Needs further information from the user
Projects
None yet
Development

No branches or pull requests

2 participants