-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
S3 access results in improperly terminated TCP sockets. #3193
Comments
Thanks for opening an issue. Are you sure this is an issue with the Ruby SDK and not your application/host configuration? If it's not an issue locally when using the SDK, it is likely not the library code. The SDK uses Net::HTTP to make requests and we use a connection pool to re-use connections. |
Thanks for the quick reply! When I said it wasn't an issue locally, I just meant that there is very little to no impact locally from having TCP sockets left in CLOSE_WAIT, but on a multi-process production environment, it very much is a problem. Connection pools only work within-process, for a multi-threaded model. In a multi-process paradigm, connection pools don't help. |
When I attempt your steps locally on my machine, I don't see the same behavior. I see one connection with lsof, then I make a call to s3 get object, and see the same connection again in lsof. Where in the SDK do you think this is occurring? To be frank, I don't think this is something we can change, it sounds like it's behavior with Net::HTTP or your configuration. |
I'm not sure what you're referring to when you say 'your configuration'. Do you mean the S3 configuration? Because that is the only constant between the application in production and my local tests using IRB and my local tests running a brand new Rails8 app with active-storage. There is nothing I'm configuring in my IRB test, just simply loading the aws-sdk-s3 gem and listing the keys in a bucket in our S3. Keeping TCP connections in this state is not the default behavior of Net::HTTP, so I don't know exactly where in the SDK this would be happening, but I assume the SDK is making these sockets, and would therefore be responsible for closing them properly. |
When I say configuration it could mean anything - I don't know much about your system. It could be your host, how you've configured processes with passenger/nginx, etc. You can check the connection_pool code here:
|
Right, I'm saying I tested this on multiple systems in multiple ways, so it very likely does not have anything to do with a system configuration. I'll look into the idle timeout and see if I can influence anything with that, thanks! |
Looks like the default value for idle timeout is 5 seconds, but these connections stick around forever as long as the process that initiated them is still running. Something is preventing these from fully closing and it is not due to a configuration setting. |
Are your calls to s3 very spiky?
|
Spiky, as in, mostly low traffic with intermittent periods of very high-volume access? In my simplest case test, I just read the keys from a bucket one time, so no traffic to speak of really. That link is not pointing to a method, nor does the |
Sorry, updated. I was referring to empty_connection_pools! |
Thanks, that does work as far as clearing the CLOSE_WAIT socket, but is entirely impractical. I wish we had clarity over why the client is not terminating these sockets upon receiving a FIN from the server. |
I'm not sure. If you find the root cause, I'm open to a change, but I don't think it's entirely related to the SDK. |
Hi – I'm a teammate of @tubsandcans – I wanted to add some more context/notes in hopes this helps clarify the issue or helps anyone else with the same issue. This is roughly my understanding of the current implementation... please correct me if I'm wrong: Seahorse::Client::NetHttp::ConnectionPool#session_for appears to be responsible for building a pool of connections to S3, all of which call out to This is consistent with the behavior we've observed, in our most simple of tests. In a fresh Rails 8 application, we're simply opening a
This is perfectly fine in our small test. However, the trouble starts in our production environment, which uses Phusion Passenger and Rails' Active Storage on top of this gem. Passenger is, by default, a single-threaded, multi-process server. Every time a user downloads a file from S3, the HTTP connection dance starts, and never finishes. Each CLOSE_WAIT connection occupies a Passenger process, because it believes there is work left to do. This continues until there are no Passenger processes left in the pool for users, who observe that the application simply "hangs". We've been able to work around this, by doing |
Thanks for the additional context. You are correct about the behavior. Idle connections are cleared when they have expired. That flow is triggered from any new request. Have you created an issue on the passenger repo or searched around there for any clues? I think it makes sense to start there. Do you see the same behavior with a different app server such as puma? |
We do see the same behavior on Puma, it's just that the impact is much lower or non-existent because Puma is multi-threaded and therefore has the ability to re-use these sockets. I very much disagree that the issue should be addressed in Passenger and not in this gem. Again, what is the reasoning for providing no option to prevent these closed connections from hanging around in a non-threaded environment? |
I'm not opposed to making some changes, maybe to allow connection pooling to be disabled. You can likely already do so with some monkey patching. I'm however skeptical about your setup. You mentioned you are using passenger and nginx. I have used neither, but based on what I see, passenger will use your nginx configuration. Can you share how you've configured passenger? Looking at https://www.phusionpassenger.com/library/config/nginx/reference/ I see many keys that may be of interest. I also notice using process instead of thread for your concurrency model may not be recommended. |
We’re using process concurrency because thread-based concurrency is only available with Passenger Enterprise, not the OSS version we are using. We’re using a fairly conventional config for the type of application we’re supporting, where our min and max pool size are the same, limiting “spawning” for users. Very similar to what’s described in Passenger’s optimization guide. I’m fairly confident that an option to disable connection pooling would resolve this for anyone using this setup, which admittedly is less popular than it once was, but still fairly common amongst “legacy” Rails apps like ours. |
That makes sense. I'm still puzzled as to why your process would block on that connection being idle. That doesn't make sense to me. I'm curious, what version of nginx are you using? |
nginx 1.26.2 So almost the latest of everything :) |
Hm ok. Because this only happens in your production with nginx/passenger, and locally there are no issues with connection pooling, I am still thinking it may be configuration. Frankly if this were a problem with the pool (which has not changed in years) there would be a lot of noise. You are using a reverse proxy technique correct? It may not use the stand alone passenger executable but I assume you have nginx accepting your requests and forwarding to the app server. From some internet searches it seems that proxy_http_version defaults to 1.0 and is important to set to 1.1. There are some other configuration values to consider. I do think you should consider cross posting this into the passenger repo even if it may end up being nginx related guidance, since they are experts in this area. |
How is the connection pool size determined? Is it based on the number of threads in the enclosing process? If so, this gem should detect if there is only one thread, to not use a connection pool and always terminate connections once transmission is complete. That would provide the best of both worlds to all users of this gem. |
Here are those changes: #3197 |
I still would like to challenge that assumption. The connection pool is thread safe so that any new connections checked in or out are handled correctly. That is why we synchronize around it. Even if you had one process with one thread, a connection pool still makes sense if you're making sequential calls before the timeout value. It reduces the overhead of creating a new connection each time. I'm still skeptical of this approach you are suggesting. We could consider a configurable size of the pool, where a 0 value could effectively turn it off, but I don't think the solution has anything to do with thread counts. I'm still also puzzled on why your application is blocking on a half open connection. I don't believe that would occupy an entire thread or process after transmission. Surely you are able to execute Ruby code right after. The intent of the empty pools method is for process/fork cases which I assume passenger is effectively doing right? Can you please share your nginx configuration and/or open a ticket with the passenger repo for discussion there? I would imagine there would be a large blowback from current customers if this was the behavior for everyone. |
I filed an issue on the Passenger repo. Hopefully they can shed some light on what is happening. In the meantime, we're just going to monkey-patch the change we need in production. I'll update here if there are any developments. |
Thank you. I hope you understand that changes like these are high risk. Let's get confidence in the root cause before taking action. Whether monkey patching or using the clean pools method, you are unblocked for now. |
Describe the bug
I notice that when reading (or writing) objects out of an S3 bucket using the aws-sdk-s3 (1.176.0) gem, a TCP connection is left in the CLOSE_WAIT state. This only goes away when the process running the Ruby program that initiated the connection dies.
This isn't really an issue locally, but in the production server which is a multi-process web-server (Passenger+Nginx), it appears these persistent connections slowly consume system resources to the point of complete exhaustion. I'm genuinely curious if this decision to leave these connections in CLOSE_WAIT was intentional and just a really poor fit for our production environment, or if it was done by mistake. Regardless, it seems at the very least imprudent to keep these connections in CLOSE_WAIT for the lifecycle of the initiating process.
I noticed this after enabling active-storage for our application, and it would slowly consume all RAM until the application crashes. Passenger is unable to re-use these web processes due to them having a TCP socket in CLOSE_WAIT, so it keeps spawning more until all system resources are maxed out.
Ruby version: 3.3.6
Regression Issue
Expected Behavior
I would not expect any evidence of a TCP connection to the S3 service once transmission is completed.
Current Behavior
After reading from an S3 bucket, a TCP socket is left in CLOSE_WAIT state.
Reproduction Steps
lsof -i -P | grep CLOSE_WAIT
and notice there are no TCP connections to an AWS resource in CLOSE_WAIT.lsof -i -P | grep CLOSE_WAIT
and notice a TCP socket for AWS in CLOSE_WAIT.lsof -i -P | grep CLOSE_WAIT
and notice this CLOSE_WAIT socket goes away.Possible Solution
Close the TCP socket upon completion of transmission.
Additional Information/Context
No response
Gem name ('aws-sdk', 'aws-sdk-resources' or service gems like 'aws-sdk-s3') and its version
aws-sdk-s3 (1.176.0)
Environment details (Version of Ruby, OS environment)
Ruby 3.3.6, macOS 14.6.1
The text was updated successfully, but these errors were encountered: