Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dask-kubernetes-operator-role-cluster clusterrole does not have the needed ACL against pods/portforward resource #909

Open
oe-hbk opened this issue Oct 9, 2024 · 3 comments
Labels

Comments

@oe-hbk
Copy link

oe-hbk commented Oct 9, 2024

Describe the issue:
The dask-kubernetes-operator pod shows an 403 Forbidden error when trying to access the k8s api. It does not seem to have the right cluster role permissions

[2024-10-08 21:48:24,704] httpx                [INFO    ] HTTP Request: GET https://10.233.0.1/api/v1/namespaces/MYNAMESPACE/pods/MYPOD/portforward?name=MYPOD&namespace=MYNAMESPACE&ports=80&_preload_content=false " HTTP/1.1 403 Forbidden"

Execcing into the pod and trying the same call against the API.

kubectl exec -it -n dask-system dask-kubernetes-operator-78d4b784cf-4r455 -- sh

$ SERVICEACCOUNT=/var/run/secrets/kubernetes.io/serviceaccount
$ NAMESPACE=$(cat ${SERVICEACCOUNT}/namespace)
$ TOKEN=$(cat ${SERVICEACCOUNT}/token)
$ CACERT=${SERVICEACCOUNT}/ca.crt
$ curl --cacert ${CACERT} --header "Authorization: Bearer ${TOKEN}" -X GET 'https://10.233.0.1/api/v1/namespaces/MYNAMESPACE/pods/MYPOD/portforward?name=MYPOD&namespace=MYNAMESPACE&ports=80&_preload_content=false'
{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {},
  "status": "Failure",
  "message": "pods \"MYPOD\" is forbidden: User \"system:serviceaccount:dask-system:dask-kubernetes-operator
\" cannot get resource \"pods/portforward\" in API group \"\" in the namespace \"MYNAMESPACE\"",
  "reason": "Forbidden",
  "details": {
    "name": "MYPOD",
    "kind": "pods"
  },
  "code": 403
}$

Editing the clusterrole,

$ kubectl edit clusterrole -n dask-system dask-kubernetes-operator-role-cluster

And adding
pods/portforward

Around

and restarting the application pod corrected the problem.

Environment:

  • Dask version: dask-kubernetes-operator-2024.5.0
  • Python version:
  • Operating System: Rocky 8
  • Install method (conda, pip, source): helm chart
@jacobtomlinson
Copy link
Member

Thanks for raising this. I wouldn't necessarily expect the controller Pod to be opening port forwards to the scheduler Pods, so there may be a deeper issue going on. Generally the controller will attempt to connect directly to the scheduler Pod, and that may be failing for some reason and so it is falling back to a port forward.

Could you check your logs for other failing connection messages?

@oe-hbk
Copy link
Author

oe-hbk commented Oct 9, 2024

Thanks @jacobtomlinson .

The following was also seen in the operator pod log:

[2024-10-08 21:46:04,848] kopf.objects         [ERROR   ] [MYNAMESPACE/MYPOD_SHORTNAME_autoscaler] Timer 'daskautoscaler_adapt' failed with an exception. Will retry.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/kopf/_core/actions/execution.py", line 276, in execute_handler_once
    result = await invoke_handler(
  File "/usr/local/lib/python3.10/site-packages/kopf/_core/actions/execution.py", line 371, in invoke_handler
    result = await invocation.invoke(
  File "/usr/local/lib/python3.10/site-packages/kopf/_core/actions/invocation.py", line 116, in invoke
    result = await fn(**kwargs)  # type: ignore
  File "/usr/local/lib/python3.10/site-packages/dask_kubernetes/operator/controller/controller.py", line 812, in daskautoscaler_adapt
    desired_workers = await get_desired_workers(
  File "/usr/local/lib/python3.10/site-packages/dask_kubernetes/operator/controller/controller.py", line 520, in get_desired_workers
    async with session.get(url) as resp:
  File "/usr/local/lib/python3.10/site-packages/aiohttp/client.py", line 1197, in __aenter__
    self._resp = await self._coro
  File "/usr/local/lib/python3.10/site-packages/aiohttp/client.py", line 608, in _request
    await resp.start(conn)
  File "/usr/local/lib/python3.10/site-packages/aiohttp/client_reqrep.py", line 976, in start
    message, payload = await protocol.read()  # type: ignore[union-attr]
  File "/usr/local/lib/python3.10/site-packages/aiohttp/streams.py", line 640, in read
    await self._waiter
aiohttp.client_exceptions.ServerDisconnectedError: Server disconnected

@jacobtomlinson
Copy link
Member

Yeah I'm not surprised by that one. We have three levels of fallback when communicating with the scheduler:

  • HTTP request to the scheduler dashboard (this is tried first but often disabled by default and results in the aiohttp error above)
  • Open an RPC to the scheduler Pod directly
  • Open a port-forward and connect the RPC over that connection

Your initial message is failing on that last step. But I'm curious why the middle step is failing at all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants