Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 access results in improperly terminated TCP sockets. #3193

Open
1 task
tubsandcans opened this issue Feb 17, 2025 · 25 comments
Open
1 task

S3 access results in improperly terminated TCP sockets. #3193

tubsandcans opened this issue Feb 17, 2025 · 25 comments
Labels
investigating Issue is being investigated

Comments

@tubsandcans
Copy link

Describe the bug

I notice that when reading (or writing) objects out of an S3 bucket using the aws-sdk-s3 (1.176.0) gem, a TCP connection is left in the CLOSE_WAIT state. This only goes away when the process running the Ruby program that initiated the connection dies.

This isn't really an issue locally, but in the production server which is a multi-process web-server (Passenger+Nginx), it appears these persistent connections slowly consume system resources to the point of complete exhaustion. I'm genuinely curious if this decision to leave these connections in CLOSE_WAIT was intentional and just a really poor fit for our production environment, or if it was done by mistake. Regardless, it seems at the very least imprudent to keep these connections in CLOSE_WAIT for the lifecycle of the initiating process.

I noticed this after enabling active-storage for our application, and it would slowly consume all RAM until the application crashes. Passenger is unable to re-use these web processes due to them having a TCP socket in CLOSE_WAIT, so it keeps spawning more until all system resources are maxed out.

Ruby version: 3.3.6

Regression Issue

  • Select this option if this issue appears to be a regression.

Expected Behavior

I would not expect any evidence of a TCP connection to the S3 service once transmission is completed.

Current Behavior

After reading from an S3 bucket, a TCP socket is left in CLOSE_WAIT state.

Reproduction Steps

  1. Run lsof -i -P | grep CLOSE_WAIT and notice there are no TCP connections to an AWS resource in CLOSE_WAIT.
  2. Read from an S3 bucket in a Ruby console
  3. Run lsof -i -P | grep CLOSE_WAIT and notice a TCP socket for AWS in CLOSE_WAIT.
  4. Terminate the Ruby console
  5. Run lsof -i -P | grep CLOSE_WAIT and notice this CLOSE_WAIT socket goes away.

Possible Solution

Close the TCP socket upon completion of transmission.

Additional Information/Context

No response

Gem name ('aws-sdk', 'aws-sdk-resources' or service gems like 'aws-sdk-s3') and its version

aws-sdk-s3 (1.176.0)

Environment details (Version of Ruby, OS environment)

Ruby 3.3.6, macOS 14.6.1

@tubsandcans tubsandcans added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Feb 17, 2025
@mullermp
Copy link
Contributor

Thanks for opening an issue. Are you sure this is an issue with the Ruby SDK and not your application/host configuration? If it's not an issue locally when using the SDK, it is likely not the library code. The SDK uses Net::HTTP to make requests and we use a connection pool to re-use connections.

@tubsandcans
Copy link
Author

Thanks for the quick reply! When I said it wasn't an issue locally, I just meant that there is very little to no impact locally from having TCP sockets left in CLOSE_WAIT, but on a multi-process production environment, it very much is a problem.

Connection pools only work within-process, for a multi-threaded model. In a multi-process paradigm, connection pools don't help.

@mullermp
Copy link
Contributor

When I attempt your steps locally on my machine, I don't see the same behavior. I see one connection with lsof, then I make a call to s3 get object, and see the same connection again in lsof.

Where in the SDK do you think this is occurring? To be frank, I don't think this is something we can change, it sounds like it's behavior with Net::HTTP or your configuration.

@mullermp mullermp added investigating Issue is being investigated and removed bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Feb 17, 2025
@tubsandcans
Copy link
Author

I'm not sure what you're referring to when you say 'your configuration'. Do you mean the S3 configuration? Because that is the only constant between the application in production and my local tests using IRB and my local tests running a brand new Rails8 app with active-storage. There is nothing I'm configuring in my IRB test, just simply loading the aws-sdk-s3 gem and listing the keys in a bucket in our S3.

Keeping TCP connections in this state is not the default behavior of Net::HTTP, so I don't know exactly where in the SDK this would be happening, but I assume the SDK is making these sockets, and would therefore be responsible for closing them properly.

@mullermp
Copy link
Contributor

When I say configuration it could mean anything - I don't know much about your system. It could be your host, how you've configured processes with passenger/nginx, etc.

You can check the connection_pool code here:

. You can try tweaking your idle timeout. Sessions are cleaned regularly based on their idle time and last used.

@tubsandcans
Copy link
Author

Right, I'm saying I tested this on multiple systems in multiple ways, so it very likely does not have anything to do with a system configuration.

I'll look into the idle timeout and see if I can influence anything with that, thanks!

@tubsandcans
Copy link
Author

Looks like the default value for idle timeout is 5 seconds, but these connections stick around forever as long as the process that initiated them is still running. Something is preventing these from fully closing and it is not due to a configuration setting.

@mullermp
Copy link
Contributor

mullermp commented Feb 17, 2025

Are your calls to s3 very spiky? _clean is called when a new session is attempted. You could try calling this method:

def empty_connection_pools!

@tubsandcans
Copy link
Author

Spiky, as in, mostly low traffic with intermittent periods of very high-volume access? In my simplest case test, I just read the keys from a bucket one time, so no traffic to speak of really.

That link is not pointing to a method, nor does the _clean method appear in the linked file.

@mullermp
Copy link
Contributor

Sorry, updated. I was referring to empty_connection_pools!

@tubsandcans
Copy link
Author

Thanks, that does work as far as clearing the CLOSE_WAIT socket, but is entirely impractical. I wish we had clarity over why the client is not terminating these sockets upon receiving a FIN from the server.

@mullermp
Copy link
Contributor

I'm not sure. If you find the root cause, I'm open to a change, but I don't think it's entirely related to the SDK.

@eng
Copy link

eng commented Feb 18, 2025

Hi – I'm a teammate of @tubsandcans – I wanted to add some more context/notes in hopes this helps clarify the issue or helps anyone else with the same issue.

This is roughly my understanding of the current implementation... please correct me if I'm wrong:

Seahorse::Client::NetHttp::ConnectionPool#session_for appears to be responsible for building a pool of connections to S3, all of which call out to http.start – however http.finish never gets called unless an exception is raised OR the next session in the same process is started, i.e. session_for is called again. At that time, stale (e.g. CLOSE_WAIT) connections are _clean'ed and the idle timeout evaluated, etc, etc.

This is consistent with the behavior we've observed, in our most simple of tests. In a fresh Rails 8 application, we're simply opening a rails console, reading a file from a test bucket, all while watching lsof for open connections. The results:

  1. Read a file
  2. File finishes download, connection goes into CLOSE_WAIT state
  3. Read another file
  4. The previous connection gets cleaned, and a new one takes its place.
  5. Exit rails console, all connections go away.

This is perfectly fine in our small test. However, the trouble starts in our production environment, which uses Phusion Passenger and Rails' Active Storage on top of this gem. Passenger is, by default, a single-threaded, multi-process server. Every time a user downloads a file from S3, the HTTP connection dance starts, and never finishes. Each CLOSE_WAIT connection occupies a Passenger process, because it believes there is work left to do. This continues until there are no Passenger processes left in the pool for users, who observe that the application simply "hangs".

We've been able to work around this, by doing Aws.empty_connection_pools! after every request to download a file from S3 across our app. This executes .finish on any open HTTP connection and has been effective thus far, although it's probably not ideal. It seems like there should be the ability to opt-in to immediately .finish connections, as opposed to joining the pool, for those running in multi-process environments like Passenger.

@mullermp
Copy link
Contributor

Thanks for the additional context. You are correct about the behavior. Idle connections are cleared when they have expired. That flow is triggered from any new request. Have you created an issue on the passenger repo or searched around there for any clues? I think it makes sense to start there. Do you see the same behavior with a different app server such as puma?

@tubsandcans
Copy link
Author

We do see the same behavior on Puma, it's just that the impact is much lower or non-existent because Puma is multi-threaded and therefore has the ability to re-use these sockets.
Passenger is not doing anything wrong here, it's simply not able to re-use its child processes that have sockets in CLOSE_WAIT state.

I very much disagree that the issue should be addressed in Passenger and not in this gem. Again, what is the reasoning for providing no option to prevent these closed connections from hanging around in a non-threaded environment?

@mullermp
Copy link
Contributor

I'm not opposed to making some changes, maybe to allow connection pooling to be disabled. You can likely already do so with some monkey patching. I'm however skeptical about your setup. You mentioned you are using passenger and nginx. I have used neither, but based on what I see, passenger will use your nginx configuration. Can you share how you've configured passenger?

Looking at https://www.phusionpassenger.com/library/config/nginx/reference/ I see many keys that may be of interest. I also notice using process instead of thread for your concurrency model may not be recommended.

@eng
Copy link

eng commented Feb 19, 2025

We’re using process concurrency because thread-based concurrency is only available with Passenger Enterprise, not the OSS version we are using.

We’re using a fairly conventional config for the type of application we’re supporting, where our min and max pool size are the same, limiting “spawning” for users. Very similar to what’s described in Passenger’s optimization guide.

https://www.phusionpassenger.com/library/config/nginx/optimization/#tuning-the-application-process-and-thread-count

I’m fairly confident that an option to disable connection pooling would resolve this for anyone using this setup, which admittedly is less popular than it once was, but still fairly common amongst “legacy” Rails apps like ours.

@mullermp
Copy link
Contributor

mullermp commented Feb 19, 2025

That makes sense. I'm still puzzled as to why your process would block on that connection being idle. That doesn't make sense to me. I'm curious, what version of nginx are you using?

@eng
Copy link

eng commented Feb 19, 2025

nginx 1.26.2
passenger 6.0.24
ruby 3.3.6
rails 8.0

So almost the latest of everything :)

@mullermp
Copy link
Contributor

Hm ok. Because this only happens in your production with nginx/passenger, and locally there are no issues with connection pooling, I am still thinking it may be configuration. Frankly if this were a problem with the pool (which has not changed in years) there would be a lot of noise. You are using a reverse proxy technique correct? It may not use the stand alone passenger executable but I assume you have nginx accepting your requests and forwarding to the app server. From some internet searches it seems that proxy_http_version defaults to 1.0 and is important to set to 1.1. There are some other configuration values to consider.

See here https://www.phusionpassenger.com/docs/advanced_guides/deployment_and_scaling/standalone/reverse_proxy.html

I do think you should consider cross posting this into the passenger repo even if it may end up being nginx related guidance, since they are experts in this area.

@tubsandcans
Copy link
Author

How is the connection pool size determined? Is it based on the number of threads in the enclosing process? If so, this gem should detect if there is only one thread, to not use a connection pool and always terminate connections once transmission is complete. That would provide the best of both worlds to all users of this gem.

@tubsandcans
Copy link
Author

Here are those changes: #3197

@mullermp
Copy link
Contributor

I still would like to challenge that assumption. The connection pool is thread safe so that any new connections checked in or out are handled correctly. That is why we synchronize around it. Even if you had one process with one thread, a connection pool still makes sense if you're making sequential calls before the timeout value. It reduces the overhead of creating a new connection each time. I'm still skeptical of this approach you are suggesting. We could consider a configurable size of the pool, where a 0 value could effectively turn it off, but I don't think the solution has anything to do with thread counts. I'm still also puzzled on why your application is blocking on a half open connection. I don't believe that would occupy an entire thread or process after transmission. Surely you are able to execute Ruby code right after. The intent of the empty pools method is for process/fork cases which I assume passenger is effectively doing right?

Can you please share your nginx configuration and/or open a ticket with the passenger repo for discussion there? I would imagine there would be a large blowback from current customers if this was the behavior for everyone.

@tubsandcans
Copy link
Author

I filed an issue on the Passenger repo. Hopefully they can shed some light on what is happening. In the meantime, we're just going to monkey-patch the change we need in production. I'll update here if there are any developments.

@mullermp
Copy link
Contributor

Thank you. I hope you understand that changes like these are high risk. Let's get confidence in the root cause before taking action. Whether monkey patching or using the clean pools method, you are unblocked for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
investigating Issue is being investigated
Projects
None yet
Development

No branches or pull requests

3 participants