-
-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Operator unable to delete Kubernetes Deployment #910
Comments
To provide some extra information, seems like the operator tries 3 times to get the information of which worker/deployment to remove:
From the logs, we see the first two failed, which was a bit unexpected given the operator can scale up the workers.
and could see that there were some 404 on the response body (would be useful to see which request it was) and after digging through the issues here, this one #807 gave some light on adding
then recreated the scheduler and we could see that likely the first http call on getting the workers to retire returned the right name (which seems to be the value from the env var DASK_WORKER_NAME, given when we open the dashboard the workers are named like that, ie matching deployment name) and they are then getting removed after all tasks were computed:
Is this setting |
There is an issue with the default settings available from the docs where the Operator tries to delete a Kubernetes Deployment using the wrong name and therefore cannot find. The Operator tries to delete a Deployment that is named like the Worker Pod name, which doesn't exist.
Reproducing steps:
helm install --repo https://helm.dask.org --create-namespace -n dask-operator --generate-name dask-kubernetes-operator
, ie this quick start step.If I check the pods, the name
simple-default-worker-057ae426b6-79bcbdb84b-vlcn7
of the deployment it tried to delete indeed exists, but as a worker pod:However, the deployment name that controls this pod has a different name:
As you can see, the deployment that controls that worker pod is actually named
simple-default-worker-057ae426b6
instead ofsimple-default-worker-057ae426b6-79bcbdb84b-vlcn7
, so as a result, the operator is unable to delete the deployments and the workers are never deleted from the namespace. It could be coming from this linehere the deletion using worker name as expected Deployment name.Anything else we need to know?:
This may be relate to #855
Environment:
The text was updated successfully, but these errors were encountered: