Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Repeated clicking on the same file eventually leads to server "Too many open files" error and session becomes unusable #4538

Closed
1 task done
diwu-sf opened this issue Oct 24, 2024 · 15 comments
Assignees
Labels
bug Something isn't working severity:medium Affecting multiple users Stale Inactive for 30 days

Comments

@diwu-sf
Copy link
Contributor

diwu-sf commented Oct 24, 2024

Is there an existing issue for the same bug?

  • I have checked the existing issues.

Describe the bug and reproduction steps

Note, I searched for the error message "Too many open files" and didn't see any open issue against this error.

Repro steps:

  1. Run with a prompt of do nothing
  2. Add files to the /workspace, just a few files will do
  3. In the Workspace UI, click open one of the files
  4. Repeatedly click on the file quickly over and over again

Eventually, the UI and backend server becomes broken.
In the UI, the message "Failed to fetch file" will show up.
In the backend, when running with DEBUG=1 make run this error message shows up:

Traceback (most recent call last):
  File "/pypoetry/virtualenvs/openhands-ai-zQh2l52--py3.12/lib/python3.12/site-packages/uvicorn/protocols/http/httptools_impl.py", line 401, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pypoetry/virtualenvs/openhands-ai-zQh2l52--py3.12/lib/python3.12/site-packages/uvicorn/middleware/proxy_headers.py", line 60, in __call__
    return await self.app(scope, receive, send)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pypoetry/virtualenvs/openhands-ai-zQh2l52--py3.12/lib/python3.12/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/pypoetry/virtualenvs/openhands-ai-zQh2l52--py3.12/lib/python3.12/site-packages/starlette/applications.py", line 113, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/pypoetry/virtualenvs/openhands-ai-zQh2l52--py3.12/lib/python3.12/site-packages/starlette/middleware/errors.py", line 187, in __call__
    raise exc
  File "/pypoetry/virtualenvs/openhands-ai-zQh2l52--py3.12/lib/python3.12/site-packages/starlette/middleware/errors.py", line 165, in __call__
    await self.app(scope, receive, _send)
  File "/pypoetry/virtualenvs/openhands-ai-zQh2l52--py3.12/lib/python3.12/site-packages/starlette/middleware/base.py", line 185, in __call__
    with collapse_excgroups():
  File "/Users/diwu/miniconda3/lib/python3.12/contextlib.py", line 158, in __exit__
    self.gen.throw(value)
  File "/pypoetry/virtualenvs/openhands-ai-zQh2l52--py3.12/lib/python3.12/site-packages/starlette/_utils.py", line 82, in collapse_excgroups
    raise exc
  File "/pypoetry/virtualenvs/openhands-ai-zQh2l52--py3.12/lib/python3.12/site-packages/starlette/middleware/base.py", line 187, in __call__
    response = await self.dispatch_func(request, call_next)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "openhands/server/listen.py", line 217, in attach_session
    request.state.conversation = await call_sync_from_async(
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "openhands/utils/async_utils.py", line 18, in call_sync_from_async
    result = await coro
             ^^^^^^^^^^
  File "/Users/diwu/miniconda3/lib/python3.12/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "openhands/utils/async_utils.py", line 17, in <lambda>
    coro = loop.run_in_executor(None, lambda: fn(*args, **kwargs))
                                              ^^^^^^^^^^^^^^^^^^^
  File "openhands/server/session/manager.py", line 52, in attach_to_conversation
    return Conversation(sid, file_store=self.file_store, config=self.config)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "openhands/server/session/conversation.py", line 31, in __init__
    self.runtime = runtime_cls(
                   ^^^^^^^^^^^^
  File "openhands/runtime/impl/eventstream/eventstream_runtime.py", line 162, in __init__
    self.docker_client: docker.DockerClient = self._init_docker_client()
                                              ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "openhands/runtime/impl/eventstream/eventstream_runtime.py", line 241, in _init_docker_client
    raise ex
  File "openhands/runtime/impl/eventstream/eventstream_runtime.py", line 236, in _init_docker_client
    return docker.from_env()
           ^^^^^^^^^^^^^^^^^
  File "/pypoetry/virtualenvs/openhands-ai-zQh2l52--py3.12/lib/python3.12/site-packages/docker/client.py", line 94, in from_env
    return cls(
           ^^^^
  File "/pypoetry/virtualenvs/openhands-ai-zQh2l52--py3.12/lib/python3.12/site-packages/docker/client.py", line 45, in __init__
    self.api = APIClient(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pypoetry/virtualenvs/openhands-ai-zQh2l52--py3.12/lib/python3.12/site-packages/docker/api/client.py", line 207, in __init__
    self._version = self._retrieve_server_version()
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pypoetry/virtualenvs/openhands-ai-zQh2l52--py3.12/lib/python3.12/site-packages/docker/api/client.py", line 230, in _retrieve_server_version
    raise DockerException(

OpenHands Installation

Docker command in README

OpenHands Version

main

Operating System

MacOS

Logs, Errors, Screenshots, and Additional Context

No response

@diwu-sf diwu-sf added the bug Something isn't working label Oct 24, 2024
@diwu-sf diwu-sf changed the title [Bug]: Repeated clicking on the same file eventually leads to server "Too many open files" error [Bug]: Repeated clicking on the same file eventually leads to server "Too many open files" error and session becomes unusable Oct 24, 2024
@diwu-sf
Copy link
Contributor Author

diwu-sf commented Oct 24, 2024

I did some debugging into this, it appears that there's many unix sockets being opened per "click" to the docker daemon socket. There appear to be a leak somewhere.

@diwu-sf
Copy link
Contributor Author

diwu-sf commented Oct 24, 2024

Every time the docker container log is retrieved, a bunch of unix socket FDs are leaked. The issue is the same as what's reported in docker/docker-py#3282

As far as I can tell, the best way to fix this would be to:

  • stop using container.logs(stream=True, follow=True) because the bug is in the generator
  • keep a timestamp of the last log retrieved time
  • poll the log each time from since to until
  • get rid of that extra background thread, just make this poll the log in [since -> until] segments every time get_and_clear is called

The other leak of the TCP socket is from requests.session() returning a response but never explicitly closing that response object, so it accumulates as leaked connection. Solutions are:

  • either refactor the way response are consumed to always call close() so the conn is returned to the pool (looks like a lot of refactor, not worth it IMO)
  • do session.headers['Connection'] = 'close' to automatically close the connection (this one worked with minimal change, I'd just fix it this way and it doesn't matter perf wise when the TCP to the sandbox doesn't get re-used)

@mamoodi mamoodi added the severity:medium Affecting multiple users label Oct 24, 2024
@tofarr tofarr self-assigned this Oct 24, 2024
@tofarr
Copy link
Collaborator

tofarr commented Oct 24, 2024

This issue should now be resolved on main

@diwu-sf
Copy link
Contributor Author

diwu-sf commented Oct 24, 2024

Bug is less severe, but it's still fundamentally there, take a look at the output of lsof -p {pid of the uvicorn process}:

python3.1 12305 diwu   98u    IPv6 0xcc4fb2f8cb35a3a9       0t0                 TCP localhost:54187->localhost:30147 (CLOSE_WAIT)
python3.1 12305 diwu   99u    unix 0xcc4fb3072e10a861       0t0                     ->0xcc4fb3072e10a929
python3.1 12305 diwu  100u    IPv6 0xcc4fb2f8cb35b1a9       0t0                 TCP localhost:54191->localhost:30147 (CLOSE_WAIT)
python3.1 12305 diwu  101u    unix 0xcc4fb3072e1083a9       0t0                     ->0xcc4fb3072e108471
python3.1 12305 diwu  102u    IPv6 0xcc4fb2f8cb35fea9       0t0                 TCP localhost:54194->localhost:30147 (CLOSE_WAIT)
python3.1 12305 diwu  103u    unix 0xcc4fb3072e1089e9       0t0                     ->0xcc4fb3072e108ab1
python3.1 12305 diwu  104u    IPv6 0xcc4fb2f8cb3605a9       0t0                 TCP localhost:54198->localhost:30147 (CLOSE_WAIT)
python3.1 12305 diwu  105u    unix 0xcc4fb3072e108e99       0t0                     ->0xcc4fb3072e109411
python3.1 12305 diwu  106u    IPv6 0xcc4fb2f8cb360ca9       0t0                 TCP localhost:54202->localhost:30147 (CLOSE_WAIT)
python3.1 12305 diwu  107u    unix 0xcc4fb3072e108219       0t0                     ->0xcc4fb3072e1082e1
python3.1 12305 diwu  108u    IPv6 0xcc4fb2f8cb35d4a9       0t0                 TCP localhost:54205->localhost:30147 (CLOSE_WAIT)
python3.1 12305 diwu  109u    unix 0xcc4fb3072e108859       0t0                     ->0xcc4fb3072e108921
python3.1 12305 diwu  110u    IPv6 0xcc4fb2f8cb35e2a9       0t0                 TCP localhost:54209->localhost:30147 (CLOSE_WAIT)
python3.1 12305 diwu  111u    unix 0xcc4fb3072e108b79       0t0                     ->0xcc4fb3072e108c41
python3.1 12305 diwu  112u    IPv6 0xcc4fb2f8cb35f0a9       0t0                 TCP localhost:54213->localhost:30147 (CLOSE_WAIT)
python3.1 12305 diwu  113u    unix 0xcc4fb3072e109fc9       0t0                     ->0xcc4fb3072e10a091
python3.1 12305 diwu  114u    IPv6 0xcc4fb2f8cb3595a9       0t0                 TCP localhost:54262->localhost:30147 (ESTABLISHED)
python3.1 12305 diwu  115u    unix 0xcc4fb3072e1097f9       0t0                     ->0xcc4fb3072e109a51
python3.1 12305 diwu  116u    IPv6 0xcc4fb2f8cb359ca9       0t0                 TCP localhost:54267->localhost:30147 (ESTABLISHED)
python3.1 12305 diwu  117u    unix 0xcc4fb3072e108791       0t0                     ->0xcc4fb3072e108d09
python3.1 12305 diwu  118u    IPv6 0xcc4fb2f8cb3579a9       0t0                 TCP localhost:54270->localhost:30147 (ESTABLISHED)
python3.1 12305 diwu  119u    unix 0xcc4fb3072e108601       0t0                     ->0xcc4fb3072e1086c9
python3.1 12305 diwu  120u    IPv6 0xcc4fb2f8cb3533a9       0t0                 TCP localhost:54274->localhost:30147 (ESTABLISHED)
python3.1 12305 diwu  121u    unix 0xcc4fb3072e10a3b1       0t0                     ->0xcc4fb3072e10a479

TCP CLOSE_WAIT is the one that's easy to solve, by setting the connection header to CLOSE.
the other unix one is still the log tailing issue

@tofarr tofarr mentioned this issue Oct 25, 2024
1 task
@diwu-sf
Copy link
Contributor Author

diwu-sf commented Oct 26, 2024

The TCP isn't being leaked, so now the only remaining accumulating leak is the unix docker sockets.
You can repro it by repeated clicking on a file and then do lsof -p ... | grep unix

@mamoodi
Copy link
Collaborator

mamoodi commented Nov 11, 2024

@diwu-sf is this bug finally resolved? tofarr put in a few fixes.

@diwu-sf
Copy link
Contributor Author

diwu-sf commented Nov 11, 2024

No there's still Unix socket leak to the docker socket due to the log streamer.

Use the same repo and you should see that Unix sockets are still being accumulated per click

@rbren
Copy link
Collaborator

rbren commented Nov 14, 2024

Was able to see the leakage by following lsof -p for this process, and clicking on files

~/.cache/pypoetry/virtualenvs/openhands-ai-uYxnY0EM-py3.12/bin/python -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=5, pipe_handle=7) --multiprocessing-fork

Strange that the uvicorn proc itself doesn't own the leak, just this one. Which I guess points to the leak being in a thread?

@diwu-sf
Copy link
Contributor Author

diwu-sf commented Nov 14, 2024 via email

@rbren
Copy link
Collaborator

rbren commented Nov 14, 2024

OK I fixed several leakage issues, but this one persists 🙃

I did start calling log_generator.close() but no luck

for posterity, you can watch the leakage with

while true; do lsof -p $(pgrep -f "tracker_fd") | wc -l; sleep 1; done

All the leaks go away if you make LogBuffer a null class (replace all method logic with pass or return empty vars)

@rbren
Copy link
Collaborator

rbren commented Nov 14, 2024

I've further confirmed that if you don't instantiate the log_generator in LogBuffer, the leak goes away.

I've also confirmed that we're closing every log_generator we create.

TBH at this point I'm assuming there's a bug in the docker SDK

@diwu-sf
Copy link
Contributor Author

diwu-sf commented Nov 14, 2024 via email

Copy link
Contributor

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale Inactive for 30 days label Dec 15, 2024
Copy link
Contributor

This issue was closed because it has been stalled for over 30 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 23, 2024
@BIJOY-SUST
Copy link

BIJOY-SUST commented Jan 26, 2025

I am trying to do the inference on swe-bench-lite and experienced "too many open files" error-

Using openhands 0.21.0 version

raise ConnectionError(err, request=request)                                                                                         [45/1959]
requests.exceptions.ConnectionError: ('Connection aborted.', OSError(24, 'Too many open files'))                                                 
Exception ignored in atexit callback: <bound method DockerRuntime.close of <openhands.runtime.impl.docker.docker_runtime.DockerRuntime object at 
0x7ff4c5edaff0>>                                                                                                                                 
Traceback (most recent call last):                                                                                                               
  File "project/openhands/runtime/impl/docker/docker_runtime.py", line 371, in close                       
    stop_all_containers(close_prefix)                                                                                                            
  File "project/openhands/runtime/impl/docker/containers.py", line 7, in stop_all_containers               
    containers = docker_client.containers.list(all=True)                                                                                         
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                         
  File "/project/.cache/pypoetry/virtualenvs/openhands-ai-xxSbwZmD-py3.12/lib/python3.12/site-packages/docker/models/containers.py", line 101
8, in list
    containers.append(self.get(r['Id']))
                      ^^^^^^^^^^^^^^^^^
  File "/project/.cache/pypoetry/virtualenvs/openhands-ai-xxSbwZmD-py3.12/lib/python3.12/site-packages/docker/models/containers.py", line 954
, in get
    resp = self.client.api.inspect_container(container_id)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/project/.cache/pypoetry/virtualenvs/openhands-ai-xxSbwZmD-py3.12/lib/python3.12/site-packages/docker/utils/decorators.py", line 19, 
in wrapped
    return f(self, resource_id, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/project/.cache/pypoetry/virtualenvs/openhands-ai-xxSbwZmD-py3.12/lib/python3.12/site-packages/docker/api/container.py", line 794, in
 inspect_container
    self._get(self._url("/containers/{0}/json", container)), True
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/project/.cache/pypoetry/virtualenvs/openhands-ai-xxSbwZmD-py3.12/lib/python3.12/site-packages/docker/utils/decorators.py", line 44, 
in inner
    return f(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/project/.cache/pypoetry/virtualenvs/openhands-ai-xxSbwZmD-py3.12/lib/python3.12/site-packages/docker/api/client.py", line 246, in _g
et
    return self.get(url, **self._set_request_timeout(kwargs))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/project/.cache/pypoetry/virtualenvs/openhands-ai-xxSbwZmD-py3.12/lib/python3.12/site-packages/requests/sessions.py", line 602, in ge
t
    return self.request("GET", url, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/project/.cache/pypoetry/virtualenvs/openhands-ai-xxSbwZmD-py3.12/lib/python3.12/site-packages/requests/sessions.py", line 589, in re
quest
    resp = self.send(prep, **send_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/project/.cache/pypoetry/virtualenvs/openhands-ai-xxSbwZmD-py3.12/lib/python3.12/site-packages/requests/sessions.py", line 703, in se
nd
    r = adapter.send(request, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/project/.cache/pypoetry/virtualenvs/openhands-ai-xxSbwZmD-py3.12/lib/python3.12/site-packages/requests/adapters.py", line 682, in se
nd
    raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', OSError(24, 'Too many open files'))
Exception ignored in atexit callback: <function stop_all_runtime_containers at 0x7ff5763272e0>
Traceback (most recent call last):

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working severity:medium Affecting multiple users Stale Inactive for 30 days
Projects
None yet
Development

No branches or pull requests

5 participants