S3 access results in improperly terminated TCP sockets. #3193

tubsandcans · 2025-02-17T18:41:05Z

Describe the bug

I notice that when reading (or writing) objects out of an S3 bucket using the aws-sdk-s3 (1.176.0) gem, a TCP connection is left in the CLOSE_WAIT state. This only goes away when the process running the Ruby program that initiated the connection dies.

This isn't really an issue locally, but in the production server which is a multi-process web-server (Passenger+Nginx), it appears these persistent connections slowly consume system resources to the point of complete exhaustion. I'm genuinely curious if this decision to leave these connections in CLOSE_WAIT was intentional and just a really poor fit for our production environment, or if it was done by mistake. Regardless, it seems at the very least imprudent to keep these connections in CLOSE_WAIT for the lifecycle of the initiating process.

I noticed this after enabling active-storage for our application, and it would slowly consume all RAM until the application crashes. Passenger is unable to re-use these web processes due to them having a TCP socket in CLOSE_WAIT, so it keeps spawning more until all system resources are maxed out.

Ruby version: 3.3.6

Regression Issue

Select this option if this issue appears to be a regression.

Expected Behavior

I would not expect any evidence of a TCP connection to the S3 service once transmission is completed.

Current Behavior

After reading from an S3 bucket, a TCP socket is left in CLOSE_WAIT state.

Reproduction Steps

Run lsof -i -P | grep CLOSE_WAIT and notice there are no TCP connections to an AWS resource in CLOSE_WAIT.
Read from an S3 bucket in a Ruby console
Run lsof -i -P | grep CLOSE_WAIT and notice a TCP socket for AWS in CLOSE_WAIT.
Terminate the Ruby console
Run lsof -i -P | grep CLOSE_WAIT and notice this CLOSE_WAIT socket goes away.

Possible Solution

Close the TCP socket upon completion of transmission.

Additional Information/Context

No response

Gem name ('aws-sdk', 'aws-sdk-resources' or service gems like 'aws-sdk-s3') and its version

aws-sdk-s3 (1.176.0)

Environment details (Version of Ruby, OS environment)

Ruby 3.3.6, macOS 14.6.1

The text was updated successfully, but these errors were encountered:

mullermp · 2025-02-17T19:05:02Z

Thanks for opening an issue. Are you sure this is an issue with the Ruby SDK and not your application/host configuration? If it's not an issue locally when using the SDK, it is likely not the library code. The SDK uses Net::HTTP to make requests and we use a connection pool to re-use connections.

tubsandcans · 2025-02-17T19:23:05Z

Thanks for the quick reply! When I said it wasn't an issue locally, I just meant that there is very little to no impact locally from having TCP sockets left in CLOSE_WAIT, but on a multi-process production environment, it very much is a problem.

Connection pools only work within-process, for a multi-threaded model. In a multi-process paradigm, connection pools don't help.

mullermp · 2025-02-17T19:55:41Z

When I attempt your steps locally on my machine, I don't see the same behavior. I see one connection with lsof, then I make a call to s3 get object, and see the same connection again in lsof.

Where in the SDK do you think this is occurring? To be frank, I don't think this is something we can change, it sounds like it's behavior with Net::HTTP or your configuration.

tubsandcans · 2025-02-17T20:58:24Z

I'm not sure what you're referring to when you say 'your configuration'. Do you mean the S3 configuration? Because that is the only constant between the application in production and my local tests using IRB and my local tests running a brand new Rails8 app with active-storage. There is nothing I'm configuring in my IRB test, just simply loading the aws-sdk-s3 gem and listing the keys in a bucket in our S3.

Keeping TCP connections in this state is not the default behavior of Net::HTTP, so I don't know exactly where in the SDK this would be happening, but I assume the SDK is making these sockets, and would therefore be responsible for closing them properly.

mullermp · 2025-02-17T21:08:46Z

When I say configuration it could mean anything - I don't know much about your system. It could be your host, how you've configured processes with passenger/nginx, etc.

You can check the connection_pool code here:

aws-sdk-ruby/gems/aws-sdk-core/lib/seahorse/client/net_http/connection_pool.rb

Line 4 in d31d585

require 'net/http'

. You can try tweaking your idle timeout. Sessions are cleaned regularly based on their idle time and last used.

tubsandcans · 2025-02-17T21:34:19Z

Right, I'm saying I tested this on multiple systems in multiple ways, so it very likely does not have anything to do with a system configuration.

I'll look into the idle timeout and see if I can influence anything with that, thanks!

tubsandcans · 2025-02-17T21:59:01Z

Looks like the default value for idle timeout is 5 seconds, but these connections stick around forever as long as the process that initiated them is still running. Something is preventing these from fully closing and it is not due to a configuration setting.

mullermp · 2025-02-17T22:08:35Z

Are your calls to s3 very spiky? _clean is called when a new session is attempted. You could try calling this method:

aws-sdk-ruby/gems/aws-sdk-core/lib/aws-sdk-core.rb

Line 148 in d31d585

def empty_connection_pools!

tubsandcans · 2025-02-17T22:15:52Z

Spiky, as in, mostly low traffic with intermittent periods of very high-volume access? In my simplest case test, I just read the keys from a bucket one time, so no traffic to speak of really.

That link is not pointing to a method, nor does the _clean method appear in the linked file.

mullermp · 2025-02-17T22:31:42Z

Sorry, updated. I was referring to empty_connection_pools!

tubsandcans · 2025-02-17T22:49:30Z

Thanks, that does work as far as clearing the CLOSE_WAIT socket, but is entirely impractical. I wish we had clarity over why the client is not terminating these sockets upon receiving a FIN from the server.

mullermp · 2025-02-17T23:17:44Z

I'm not sure. If you find the root cause, I'm open to a change, but I don't think it's entirely related to the SDK.

eng · 2025-02-18T22:20:44Z

Hi – I'm a teammate of @tubsandcans – I wanted to add some more context/notes in hopes this helps clarify the issue or helps anyone else with the same issue.

This is roughly my understanding of the current implementation... please correct me if I'm wrong:

Seahorse::Client::NetHttp::ConnectionPool#session_for appears to be responsible for building a pool of connections to S3, all of which call out to http.start – however http.finish never gets called unless an exception is raised OR the next session in the same process is started, i.e. session_for is called again. At that time, stale (e.g. CLOSE_WAIT) connections are _clean'ed and the idle timeout evaluated, etc, etc.

This is consistent with the behavior we've observed, in our most simple of tests. In a fresh Rails 8 application, we're simply opening a rails console, reading a file from a test bucket, all while watching lsof for open connections. The results:

Read a file
File finishes download, connection goes into CLOSE_WAIT state
Read another file
The previous connection gets cleaned, and a new one takes its place.
Exit rails console, all connections go away.

This is perfectly fine in our small test. However, the trouble starts in our production environment, which uses Phusion Passenger and Rails' Active Storage on top of this gem. Passenger is, by default, a single-threaded, multi-process server. Every time a user downloads a file from S3, the HTTP connection dance starts, and never finishes. Each CLOSE_WAIT connection occupies a Passenger process, because it believes there is work left to do. This continues until there are no Passenger processes left in the pool for users, who observe that the application simply "hangs".

We've been able to work around this, by doing Aws.empty_connection_pools! after every request to download a file from S3 across our app. This executes .finish on any open HTTP connection and has been effective thus far, although it's probably not ideal. It seems like there should be the ability to opt-in to immediately .finish connections, as opposed to joining the pool, for those running in multi-process environments like Passenger.

mullermp · 2025-02-18T23:11:06Z

Thanks for the additional context. You are correct about the behavior. Idle connections are cleared when they have expired. That flow is triggered from any new request. Have you created an issue on the passenger repo or searched around there for any clues? I think it makes sense to start there. Do you see the same behavior with a different app server such as puma?

tubsandcans · 2025-02-18T23:33:54Z

We do see the same behavior on Puma, it's just that the impact is much lower or non-existent because Puma is multi-threaded and therefore has the ability to re-use these sockets.
Passenger is not doing anything wrong here, it's simply not able to re-use its child processes that have sockets in CLOSE_WAIT state.

I very much disagree that the issue should be addressed in Passenger and not in this gem. Again, what is the reasoning for providing no option to prevent these closed connections from hanging around in a non-threaded environment?

mullermp · 2025-02-19T00:14:33Z

I'm not opposed to making some changes, maybe to allow connection pooling to be disabled. You can likely already do so with some monkey patching. I'm however skeptical about your setup. You mentioned you are using passenger and nginx. I have used neither, but based on what I see, passenger will use your nginx configuration. Can you share how you've configured passenger?

Looking at https://www.phusionpassenger.com/library/config/nginx/reference/ I see many keys that may be of interest. I also notice using process instead of thread for your concurrency model may not be recommended.

eng · 2025-02-19T01:27:29Z

We’re using process concurrency because thread-based concurrency is only available with Passenger Enterprise, not the OSS version we are using.

We’re using a fairly conventional config for the type of application we’re supporting, where our min and max pool size are the same, limiting “spawning” for users. Very similar to what’s described in Passenger’s optimization guide.

https://www.phusionpassenger.com/library/config/nginx/optimization/#tuning-the-application-process-and-thread-count

I’m fairly confident that an option to disable connection pooling would resolve this for anyone using this setup, which admittedly is less popular than it once was, but still fairly common amongst “legacy” Rails apps like ours.

mullermp · 2025-02-19T03:02:54Z

That makes sense. I'm still puzzled as to why your process would block on that connection being idle. That doesn't make sense to me. I'm curious, what version of nginx are you using?

eng · 2025-02-19T03:13:24Z

nginx 1.26.2
passenger 6.0.24
ruby 3.3.6
rails 8.0

So almost the latest of everything :)

mullermp · 2025-02-19T04:11:07Z

Hm ok. Because this only happens in your production with nginx/passenger, and locally there are no issues with connection pooling, I am still thinking it may be configuration. Frankly if this were a problem with the pool (which has not changed in years) there would be a lot of noise. You are using a reverse proxy technique correct? It may not use the stand alone passenger executable but I assume you have nginx accepting your requests and forwarding to the app server. From some internet searches it seems that proxy_http_version defaults to 1.0 and is important to set to 1.1. There are some other configuration values to consider.

See here https://www.phusionpassenger.com/docs/advanced_guides/deployment_and_scaling/standalone/reverse_proxy.html

I do think you should consider cross posting this into the passenger repo even if it may end up being nginx related guidance, since they are experts in this area.

tubsandcans · 2025-02-19T17:49:51Z

How is the connection pool size determined? Is it based on the number of threads in the enclosing process? If so, this gem should detect if there is only one thread, to not use a connection pool and always terminate connections once transmission is complete. That would provide the best of both worlds to all users of this gem.

tubsandcans · 2025-02-19T19:24:33Z

Here are those changes: #3197

mullermp · 2025-02-19T19:49:00Z

I still would like to challenge that assumption. The connection pool is thread safe so that any new connections checked in or out are handled correctly. That is why we synchronize around it. Even if you had one process with one thread, a connection pool still makes sense if you're making sequential calls before the timeout value. It reduces the overhead of creating a new connection each time. I'm still skeptical of this approach you are suggesting. We could consider a configurable size of the pool, where a 0 value could effectively turn it off, but I don't think the solution has anything to do with thread counts. I'm still also puzzled on why your application is blocking on a half open connection. I don't believe that would occupy an entire thread or process after transmission. Surely you are able to execute Ruby code right after. The intent of the empty pools method is for process/fork cases which I assume passenger is effectively doing right?

Can you please share your nginx configuration and/or open a ticket with the passenger repo for discussion there? I would imagine there would be a large blowback from current customers if this was the behavior for everyone.

tubsandcans · 2025-02-19T21:09:59Z

I filed an issue on the Passenger repo. Hopefully they can shed some light on what is happening. In the meantime, we're just going to monkey-patch the change we need in production. I'll update here if there are any developments.

mullermp · 2025-02-19T21:16:06Z

Thank you. I hope you understand that changes like these are high risk. Let's get confidence in the root cause before taking action. Whether monkey patching or using the clean pools method, you are unblocked for now.

tubsandcans added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Feb 17, 2025

tubsandcans mentioned this issue Feb 17, 2025

ActiveStorage S3 service does not properly terminate TCP connections. rails/rails#54535

Closed

mullermp added investigating Issue is being investigated and removed bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Feb 17, 2025

tubsandcans mentioned this issue Feb 19, 2025

Processes with unterminated TCP connections are non reusable. phusion/passenger#2592

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

S3 access results in improperly terminated TCP sockets. #3193

S3 access results in improperly terminated TCP sockets. #3193

tubsandcans commented Feb 17, 2025

mullermp commented Feb 17, 2025

tubsandcans commented Feb 17, 2025

mullermp commented Feb 17, 2025

tubsandcans commented Feb 17, 2025

mullermp commented Feb 17, 2025

tubsandcans commented Feb 17, 2025

tubsandcans commented Feb 17, 2025

mullermp commented Feb 17, 2025 •

edited

Loading

tubsandcans commented Feb 17, 2025

mullermp commented Feb 17, 2025

tubsandcans commented Feb 17, 2025

mullermp commented Feb 17, 2025

eng commented Feb 18, 2025

mullermp commented Feb 18, 2025

tubsandcans commented Feb 18, 2025

mullermp commented Feb 19, 2025

eng commented Feb 19, 2025

mullermp commented Feb 19, 2025 •

edited

Loading

eng commented Feb 19, 2025 •

edited

Loading

mullermp commented Feb 19, 2025

tubsandcans commented Feb 19, 2025

tubsandcans commented Feb 19, 2025

mullermp commented Feb 19, 2025

tubsandcans commented Feb 19, 2025

mullermp commented Feb 19, 2025

S3 access results in improperly terminated TCP sockets. #3193

S3 access results in improperly terminated TCP sockets. #3193

Comments

tubsandcans commented Feb 17, 2025

Describe the bug

Regression Issue

Expected Behavior

Current Behavior

Reproduction Steps

Possible Solution

Additional Information/Context

Gem name ('aws-sdk', 'aws-sdk-resources' or service gems like 'aws-sdk-s3') and its version

Environment details (Version of Ruby, OS environment)

mullermp commented Feb 17, 2025

tubsandcans commented Feb 17, 2025

mullermp commented Feb 17, 2025

tubsandcans commented Feb 17, 2025

mullermp commented Feb 17, 2025

tubsandcans commented Feb 17, 2025

tubsandcans commented Feb 17, 2025

mullermp commented Feb 17, 2025 • edited Loading

tubsandcans commented Feb 17, 2025

mullermp commented Feb 17, 2025

tubsandcans commented Feb 17, 2025

mullermp commented Feb 17, 2025

eng commented Feb 18, 2025

mullermp commented Feb 18, 2025

tubsandcans commented Feb 18, 2025

mullermp commented Feb 19, 2025

eng commented Feb 19, 2025

mullermp commented Feb 19, 2025 • edited Loading

eng commented Feb 19, 2025 • edited Loading

mullermp commented Feb 19, 2025

tubsandcans commented Feb 19, 2025

tubsandcans commented Feb 19, 2025

mullermp commented Feb 19, 2025

tubsandcans commented Feb 19, 2025

mullermp commented Feb 19, 2025

mullermp commented Feb 17, 2025 •

edited

Loading

mullermp commented Feb 19, 2025 •

edited

Loading

eng commented Feb 19, 2025 •

edited

Loading