Culler not working when there are too many users? #88

CarlosDominguezBecerril · 2025-01-23T17:12:44Z

Bug description

Culler is not culling despite the last activity clearly being surpassed by the maximum time allowed.

Current setup from our k8s logs:
Starting service 'cull-idle': ['python3', '-m', 'jupyterhub_idle_culler', '--url=https://<url>:5081/hub/api', '--timeout=14400', '--cull-every=600', '--concurrency=10']

We observe two things:

user is using the default server

"last_activity": "2025-01-23T16:01:47.564284Z",
"started": "2025-01-13T09:57:08.123732Z",

user is not using the analytics named server

"last_activity": "2025-01-20T12:06:54.100549Z",
"started": "2025-01-13T13:51:04.802281Z",

In the second case the last_activity was over 3days ago which is way more than the timeout of 14400 (4 hours)

Example obtained from /api/users/<user>

{
    "server": "/user/<user>/",
    "created": "2023-02-06T09:37:18.564165Z",
    "admin": false,
    "name": "<user>",
    "kind": "user",
    "last_activity": "2025-01-23T16:06:51.635146Z",
    "roles": [
        "user"
    ],
    "auth_state": {
        "email": "<email>"
    },
    "pending": null,
    "groups": [
        "<groups>"
    ],
    "servers": {
        "": {
            "name": "",
            "full_name": "<user>/",
            "last_activity": "2025-01-23T16:01:47.564284Z",
            "started": "2025-01-13T09:57:08.123732Z",
            "pending": null,
            "ready": true,
            "stopped": false,
            "url": "/user/<user>/",
            "user_options": {
                "profile": "scipy-spark",
                "server_size": "model_training",
                "service_account": "data-science"
            },
            "progress_url": "/hub/api/users/<user>/server/progress",
            "full_url": null,
            "full_progress_url": null,
            "state": {
                "pod_name": "jupyter-<first_name>-2d<last_name>",
                "namespace": "jupyter-hub",
                "dns_name": "jupyter-<first_name>-2d<last_name>.jupyter-hub.svc.cluster.local"
            }
        },
        "analytics": {
            "name": "analytics",
            "full_name": "<user>/analytics",
            "last_activity": "2025-01-20T12:06:54.100549Z",
            "started": "2025-01-13T13:51:04.802281Z",
            "pending": null,
            "ready": true,
            "stopped": false,
            "url": "/user/<user>/analytics/",
            "user_options": {
                "profile": "python-3-analytics",
                "server_size": "default",
                "service_account": "data-science"
            },
            "progress_url": "/hub/api/users/<user>/servers/analytics/progress",
            "full_url": null,
            "full_progress_url": null,
            "state": {
                "pod_name": "jupyter-<first_name>-2d<last_name>--analytics",
                "namespace": "jupyter-hub",
                "dns_name": "jupyter-<first_name>-2d<last_name>--analytics.jupyter-hub.svc.cluster.local"
            }
        }
    }
}

How to reproduce

I don't know how to reproduce this, we have jupyterhub deployed in several datacenters but only fails in our busiest one with over 150 servers simultaneously at the same time (not sure if it's an issue).

I'm afraid that the service is blocked / not running as expected because there are too many users

Expected behaviour

To cull the servers after they timeout

Actual behaviour

Is not culling the servers after the timeout

Your personal set up

Latest version of jupyterhub and idle culler

Logs

I can see the following logs from the culler:
⚠ Not sure if important but the fetching page 2 doesn't show the port just the url

Jan 23 16:41:12.885
nbt-hub
[I 250123 15:41:11 __init__:156] Fetching page 2 https://<url without port>/hub/api/users?state=ready&offset=50&limit=50

Jan 23 16:31:12.600
nbt-hub
File "/usr/local/lib/python3.10/dist-packages/jupyterhub_idle_culler/__init__.py", line 124, in fetch

Jan 23 16:31:12.600
nbt-hub
File "/usr/local/lib/python3.10/dist-packages/jupyterhub_idle_culler/__init__.py", line 142, in fetch_paginated

Jan 23 16:31:12.599
nbt-hub
File "/usr/local/lib/python3.10/dist-packages/jupyterhub_idle_culler/__init__.py", line 436, in cull_idle

Sometimes I also get

/usr/lib/python3.10/collections/__init__.py:431: RuntimeWarning: coroutine 'cull_idle.<locals>.handle_user' was never awaited

The text was updated successfully, but these errors were encountered:

CarlosDominguezBecerril · 2025-01-23T19:12:54Z

After looking into the issue I found the problem.

During paginationnext_info["url"] doesn't contain the port in the url

req.url = next_info["url"]

https://github.com/jupyterhub/jupyterhub-idle-culler/blob/main/jupyterhub_idle_culler/__init__.py#L158

which throws an error during fetching

[I 250123 20:02:44 __init__:156] Fetching page 2 https://<my_domain>/hub/api/users?state=ready&offset=50&limit=50
[E 250123 20:02:44 ioloop:770] Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOMainLoop object at 0xf73e1c04f580>>, <Task finished name='Task-1' coro=<cull_idle() done, defined at /usr/local/lib/python3.10/dist-packages/jupyterhub_idle_culler/__init__.py:78> exception=ConnectionRefusedError(111, 'Connection refused')>)
    Traceback (most recent call last):
      File "/usr/local/lib/python3.10/dist-packages/tornado/ioloop.py", line 750, in _run_callback
        ret = callback()
      File "/usr/local/lib/python3.10/dist-packages/tornado/ioloop.py", line 774, in _discard_future_result
        future.result()
      File "/usr/local/lib/python3.10/dist-packages/jupyterhub_idle_culler/__init__.py", line 436, in cull_idle
        async for user in fetch_paginated(req):
      File "/usr/local/lib/python3.10/dist-packages/jupyterhub_idle_culler/__init__.py", line 142, in fetch_paginated
        response = await resp_future
      File "/usr/local/lib/python3.10/dist-packages/jupyterhub_idle_culler/__init__.py", line 124, in fetch
        return await client.fetch(req)
    ConnectionRefusedError: [Errno 111] Connection refused

Dummy fix for my use case would be:

req.url = next_info["url"].replace(
    "<my_url>",
    "<my_url>:5081"
)

Not ideal, but I was wondering, Is this something jupyterhub should return properly? or the problem is the culler?

I believe it could be jupyterhub since if we have hub on

localhost:8000

first call would look like

localhost:8000/hub/api/users?state=ready&offset=0&limit=50

and the pagination would return as next

localhost/hub/api/users?state=ready&offset=0&limit=50 without the port which is incorrect

manics · 2025-02-04T15:46:12Z

Please can you tell us your Z2JH version, and share your full configuration?

CarlosDominguezBecerril · 2025-02-14T09:33:54Z

We are using Z2JH version 2.0.0. As for our configuration:

For our culler config:

# cull relates to the jupyterhub-idle-culler service, responsible for evicting
# inactive singleuser pods.
#
# The configuration below, except for enabled, corresponds to command-line flags
# for jupyterhub-idle-culler as documented here:
# https://github.com/jupyterhub/jupyterhub-idle-culler#as-a-standalone-script
#
cull:
  enabled: true
  users: false # --cull-users
  removeNamedServers: false # --remove-named-servers
  timeout: 14400 # --timeout (in seconds) -> server will shutdown if inactive after 4 hours
  every: 600 # --cull-every (in seconds) -> checks for expired servers every 10 minutes
  concurrency: 10 # --concurrency
  maxAge: 0 # --max-age

Installed versions that I believe matter:

"jupyterhub==5.2.0",
"jupyterhub-traefik-proxy==2.0.0",
"jupyterhub-idle-culler==1.4.0",
"jupyterhub-announcement",
"kubernetes==28.1.0",

CarlosDominguezBecerril added the bug Something isn't working label Jan 23, 2025

CarlosDominguezBecerril mentioned this issue Jan 24, 2025

Jupyterhub culler not shutting down servers automatically on jupyterlab 4.3.X version jupyterlab/jupyterlab#17086

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Culler not working when there are too many users? #88

Culler not working when there are too many users? #88

CarlosDominguezBecerril commented Jan 23, 2025 •

edited

Loading

CarlosDominguezBecerril commented Jan 23, 2025 •

edited

Loading

manics commented Feb 4, 2025

CarlosDominguezBecerril commented Feb 14, 2025 •

edited

Loading

Culler not working when there are too many users? #88

Culler not working when there are too many users? #88

Comments

CarlosDominguezBecerril commented Jan 23, 2025 • edited Loading

Bug description

How to reproduce

Expected behaviour

Actual behaviour

Your personal set up

Logs

CarlosDominguezBecerril commented Jan 23, 2025 • edited Loading

manics commented Feb 4, 2025

CarlosDominguezBecerril commented Feb 14, 2025 • edited Loading

CarlosDominguezBecerril commented Jan 23, 2025 •

edited

Loading

CarlosDominguezBecerril commented Jan 23, 2025 •

edited

Loading

CarlosDominguezBecerril commented Feb 14, 2025 •

edited

Loading