You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi gang. So I just managed to install Dask-Operator on a K8S cluster with the necessary Dask CRDs. To try some stuffs out I decided to use Jupyter and prop-up my own Jupyter Lab server running in a namespace, set up the service account for it, ensure that it works and finally installed Dask and dask-kubernetes.
I tried creating a toy dask cluster and it works.
So I decided to replicate the whole yaml set up with the replacing of the namespace, to access some name-space bounded resources. However in the new namespace, the cluster scheduler-pod spawned, but when it comes to scaling the workers the Jupyter output show some really strange error (below).
I've exhausted my Google search and nothing came up. Hope you can point me to a right direction.
NotFoundError Traceback (most recent call last)
Cell In[85], line 5
3 # KubeCluster?
4 cluster = KubeCluster(name="my-dask-cluster", namespace="dask-jobs-ns", create_mode=CreateMode.CREATE_ONLY)
----> 5 cluster.scale(2)
File /opt/conda/lib/python3.10/site-packages/dask_kubernetes/operator/kubecluster/kubecluster.py:729, in KubeCluster.scale(self, n, worker_group)
713 def scale(self, n, worker_group="default"):
714 """Scale cluster to n workers
715
716 Parameters
(...)
726 >>> cluster.scale(7, worker_group="high-mem-workers") # scale worker group high-mem-workers to seven workers
727 """
--> 729 return self.sync(self._scale, n, worker_group)
File /opt/conda/lib/python3.10/site-packages/distributed/utils.py:358, in SyncMethodMixin.sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
356 return future
357 else:
--> 358 return sync(
359 self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
360 )
File /opt/conda/lib/python3.10/site-packages/distributed/utils.py:434, in sync(loop, func, callback_timeout, *args, **kwargs)
431 wait(10)
433 if error is not None:
--> 434 raise error
435 else:
436 return result
File /opt/conda/lib/python3.10/site-packages/distributed/utils.py:408, in sync.<locals>.f()
406 awaitable = wait_for(awaitable, timeout)
407 future = asyncio.ensure_future(awaitable)
--> 408 result = yield future
409 except Exception as exception:
410 error = exception
File /opt/conda/lib/python3.10/site-packages/tornado/gen.py:767, in Runner.run(self)
765 try:
766 try:
--> 767 value = future.result()
768 except Exception as e:
769 # Save the exception for later. It's important that
770 # gen.throw() not be called inside this try/except block
771 # because that makes sys.exc_info behave unexpectedly.
772 exc: Optional[Exception] = e
File /opt/conda/lib/python3.10/site-packages/dask_kubernetes/operator/kubecluster/kubecluster.py:740, in KubeCluster._scale(self, n, worker_group)
735 await autoscaler.delete()
737 wg = await DaskWorkerGroup(
738 f"{self.name}-{worker_group}", namespace=self.namespace
739 )
--> 740 await wg.scale(n)
741 for instance in self._instances:
742 if instance.name == self.name:
File /opt/conda/lib/python3.10/site-packages/kr8s/_objects.py:307, in APIObject.scale(self, replicas)
305 if not self.scalable:
306 raise NotImplementedError(f"{self.kind} is not scalable")
--> 307 await self._exists(ensure=True)
308 await self._patch({"spec": dot_to_nested_dict(self.scalable_spec, replicas)})
309 while self.replicas != replicas:
File /opt/conda/lib/python3.10/site-packages/kr8s/_objects.py:227, in APIObject._exists(self, ensure)
225 return True
226 if ensure:
--> 227 raise NotFoundError(f"Object {self.name} does not exist")
228 return False
**NotFoundError: Object my-dask-cluster-default does not exist**
Environment:
Dask version: 2023.5.0
Python version:
Operating System:
Install method (conda, pip, source): pip
The text was updated successfully, but these errors were encountered:
UPDATE: as a Hail Mary I shutdown the Jupyter Kernel and this resolved. I won't be closing for now this since I think this could be a problem with freshly installed Dasks.
UPDATE: it came back, restarting the kernel is not a hack...
Ok so just to check, you can successfully create Dask clusters in the same namespace that your Jupyter Pod is running. But when you try and create clusters in other namespaces it fails?
My guess is that there is a bug causing the "dask-jobs-ns" namespace setting to be dropped somewhere and it's trying to look up something in the current namespace and failing.
To test this hypothesis could you change your default namespace to dask-jobs-ns, restart and kernel and try again. My expectation is that creating a cluster in the dask-jobs-ns works, but creating in the current namespace no longer works.
Hi gang. So I just managed to install Dask-Operator on a K8S cluster with the necessary Dask CRDs. To try some stuffs out I decided to use Jupyter and prop-up my own Jupyter Lab server running in a namespace, set up the service account for it, ensure that it works and finally installed Dask and dask-kubernetes.
I tried creating a toy dask cluster and it works.
So I decided to replicate the whole yaml set up with the replacing of the namespace, to access some name-space bounded resources. However in the new namespace, the cluster scheduler-pod spawned, but when it comes to scaling the workers the Jupyter output show some really strange error (below).
I've exhausted my Google search and nothing came up. Hope you can point me to a right direction.
Environment:
The text was updated successfully, but these errors were encountered: