Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aws::CloudWatch::Client doesn't automatically refresh credentials #3162

Open
1 task
Roy-Gal-Git opened this issue Jan 7, 2025 · 36 comments
Open
1 task

Aws::CloudWatch::Client doesn't automatically refresh credentials #3162

Roy-Gal-Git opened this issue Jan 7, 2025 · 36 comments

Comments

@Roy-Gal-Git
Copy link

Roy-Gal-Git commented Jan 7, 2025

Describe the bug

We use the sidekiq-cloudwatchmetrics gem to send Sidekiq-related metrics to CloudWatch. Occasionally, we encounter the following error:

Aws::CloudWatch::Errors::ExpiredToken: The security token included in the request is expired

Upon investigation, we found that the gem initializes an Aws::CloudWatch::Client, which uses Aws::InstanceProfileCredentials. These credentials are supposed to auto-refresh before expiration. However, the error indicates that the token is not being refreshed in time, causing the metrics to fail. Once the error starts, it repeats, and the metrics fail to recover.

Regression Issue

  • Select this option if this issue appears to be a regression.

Expected Behavior

The Aws::InstanceProfileCredentials should automatically refresh the security token before expiration (approximately 5 minutes before expiry) and ensure that the put_metric_data API call succeeds without errors. Metrics should continue to be sent to CloudWatch without interruption.

Current Behavior

The Aws::CloudWatch::Client raises Aws::CloudWatch::Errors::ExpiredToken errors when attempting to send metrics.

Once this error occurs:

  1. The metrics stop being sent to CloudWatch.
  2. The error persists until the process is restarted or manually handled.
  3. The auto-refresh mechanism in Aws::InstanceProfileCredentials does not appear to trigger correctly in this context.

Reproduction Steps

Configure the sidekiq-cloudwatchmetrics gem in a Sidekiq setup that uses an IAM role (e.g., on EC2 or ECS).
You can find the configuration I used here:
Allow the process to run for an extended period, relying on auto-refreshable credentials from Aws::InstanceProfileCredentials.
Observe that after some time, the following error may appear:
Aws::CloudWatch::Errors::ExpiredToken: The security token included in the request is expired

Possible Solution

No response

Additional Information/Context

Here are some useful links for more context regarding my problem:

Gem name ('aws-sdk', 'aws-sdk-resources' or service gems like 'aws-sdk-s3') and its version

aws-sdk-cloudwatch

Environment details (Version of Ruby, OS environment)

Ruby - 3.2.5, OS - Debian GNU/Linux 12 (bookworm), Docker image - ruby:3.2.5-slim, sidekiq-cloudwatchmetrics - 2.6.0

@Roy-Gal-Git Roy-Gal-Git added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Jan 7, 2025
@alextwoods
Copy link
Contributor

You are correct that the InstanceProfileCredentials should be auto refreshing (if they are within 5 minutes of expiration).

We added "static stability" logic awhile back that allow expired credentials to be used after their expiration when the underlying credentials service isn't refreshing credentials on the hosts (eg: an outage). Are you seeing any logs like Attempting credential expiration extension due to a credential service availability issue?

Have you tested your PR that does the explicit credentials refresh - does that fix the issue for you?

@alextwoods alextwoods added investigating Issue is being investigated and removed needs-triage This issue or PR still needs to be triaged. labels Jan 7, 2025
@Roy-Gal-Git
Copy link
Author

Hi @alextwoods,

I’m unable to reproduce the error, so I can’t verify whether the issue has been resolved. Even if my PR makes it to production, I won’t be able to confirm if it fixes the issue for me, as the error occurs randomly across different production environments.

Are you seeing any logs like Attempting credential expiration extension due to a credential service availability issue?

No, I didn't find any logs such as what you suggested.
I tried looking for anything containing the words 'credential' and 'attempting' (besides the exact match of the full message you sent) but I didn't find anything either.

@mullermp
Copy link
Contributor

mullermp commented Jan 7, 2025

This may be a naive question, but are you certain that the other gem is resolving instance profile credentials, and credentials are not set globally or in some other way? Reading your other thread, it looks like you tried creating a client on your own, and the credentials defaulted to instance profile credentials (which is last in the chain) but it's possible in your app in production, it is set to something else. I would actually output an instance of your credentials provider after deploying somehow.

@Roy-Gal-Git
Copy link
Author

Hi @mullermp,

are you certain that the other gem is resolving instance profile credentials, and credentials are not set globally or in some other way?

The other gem's CloudWatch publisher requires a client argument, and users can determine how this client is configured. In my case, I explicitly pass Aws::CloudWatch::Client.new(region: REGION_PLACEHOLDER), which resolves to instance profile credentials in the production environment.

I would actually output an instance of your credentials provider after deploying somehow

I shared this in the other thread as well, but here’s the truncated output of @client.config (where @client is the CloudWatch client I pass to the gem’s publisher) from a production environment:

#<struct
	credentials=
		#<Aws::InstanceProfileCredentials:0x00007f78ee260660
			...
			@credentials=#<Aws::Credentials access_key_id="PLACEHOLDER_ACCESS_KEY_ID">,
			...>
	...
	secret_access_key=nil
	session_token=nil
	...>

This confirms that the client resolves to Aws::InstanceProfileCredentials, as expected in this setup.

@mullermp
Copy link
Contributor

mullermp commented Jan 8, 2025

Yes I saw that. Try passing an instance of instance profile credentials explicitly to the client in your production code. Looking at the gem source, I don't think any patch is necessary. The SDK will already refresh and retry if necessary.

@Roy-Gal-Git
Copy link
Author

Roy-Gal-Git commented Jan 8, 2025

@mullermp How come it doesn't already refresh and retry if it does use an instance of instance profile credentials?

@mullermp
Copy link
Contributor

mullermp commented Jan 8, 2025

It does and should. You can verify that by fetching instance credentials from irb or pry from your host and inspect them after accessing them before expiration. How did you verify the credentials class in production? Using a rails console or print statement? Or through local testing? By passing in credentials explicitly, we can be sure that is what is used. What version of aws-sdk-cloudwatch and aws-sdk-core are you using?

@Roy-Gal-Git
Copy link
Author

How did you verify the credentials class in production?

Rails console. I initialized an Aws::CloudWatch::Client object the way I do for thesidekiq-cloudwatchmetrics publisher and performed the checks on that object. You're saying the client credentials could differ when tested from rails console?

What version of aws-sdk-cloudwatch and aws-sdk-core are you using?

aws-sdk-cloudwatch (1.108.0)
aws-sdk-core (3.214.0)

@mullermp
Copy link
Contributor

mullermp commented Jan 8, 2025

They shouldn't differ. Trust but verify, but did you pass rails env as production in your console? Are you setting global configuration anywhere such as Aws.config[key]?

Otherwise I'm not sure what may be going on. I'm inclined to believe it's something with your environment because if those credentials don't refresh then there would be a lot more noise and our internal usage would stop working.

@Roy-Gal-Git
Copy link
Author

did you pass rails env as production in your console?

We use a setup where the Rails environment name matches the server name (e.g., us, eu). The correct Rails env is automatically passed to the Rails console as an exported environment variable.

Are you setting global configuration anywhere such as Aws.config[key]?

No, we are not setting Aws.config[key] anywhere in our application.

if those credentials don't refresh then there would be a lot more noise and our internal usage would stop working.

Interestingly, the only component affected is the sidekiq-cloudwatchmetrics publisher. The app continues to function as expected, but metrics stop being sent to CloudWatch, which triggers an alarm after a short while.

@alextwoods
Copy link
Contributor

The app continues to function as expected, but metrics stop being sent to CloudWatch, which triggers an alarm after a short while.

Do you use other AWS services in your application? If so, its interesting that you would get expired token issues only in cloudwatch.

Do you know if you are using IMDS V1 or V2? And do you know if you've disabled V1 through configuration (example the AWS_EC2_METADATA_V1_DISABLED env variable)? There are some differences in the refresh process for credentials from V1 and V2.

@Roy-Gal-Git
Copy link
Author

@alextwoods

Do you use other AWS services in your application?

Yes, S3, SSM, STS...

Do you know if you are using IMDS V1 or V2?

We're using IMDS V2.

And do you know if you've disabled V1 through configuration

Yes, IMDS V1 is disabled.

@Roy-Gal-Git
Copy link
Author

I believe the backtrace of the error could provide valuable insights for the investigation:

/gems/aws-sdk-core-3.214.0/lib/seahorse/client/plugins/raise_response_errors.rb:17 in call
/gems/aws-sdk-core-3.214.0/lib/aws-sdk-core/plugins/checksum_algorithm.rb:111 in call
/gems/aws-sdk-core-3.214.0/lib/aws-sdk-core/plugins/jsonvalue_converter.rb:16 in call
/gems/aws-sdk-core-3.214.0/lib/aws-sdk-core/plugins/invocation_id.rb:16 in call
/gems/aws-sdk-core-3.214.0/lib/aws-sdk-core/plugins/idempotency_token.rb:19 in call
/gems/aws-sdk-core-3.214.0/lib/aws-sdk-core/plugins/param_converter.rb:26 in call
/gems/aws-sdk-core-3.214.0/lib/seahorse/client/plugins/request_callback.rb:89 in call
/gems/aws-sdk-core-3.214.0/lib/aws-sdk-core/plugins/response_paging.rb:12 in call
/gems/aws-sdk-core-3.214.0/lib/seahorse/client/plugins/response_target.rb:24 in call
/gems/aws-sdk-core-3.214.0/lib/aws-sdk-core/plugins/telemetry.rb:39 in block in call
/gems/aws-sdk-core-3.214.0/lib/aws-sdk-core/telemetry/no_op.rb:29 in in_span
/gems/aws-sdk-core-3.214.0/lib/aws-sdk-core/plugins/telemetry.rb:53 in span_wrapper
/gems/aws-sdk-core-3.214.0/lib/aws-sdk-core/plugins/telemetry.rb:39 in call
/gems/aws-sdk-core-3.214.0/lib/seahorse/client/request.rb:72 in send_request
/gems/aws-sdk-cloudwatch-1.108.0/lib/aws-sdk-cloudwatch/client.rb:3865 in put_metric_data
/gems/sidekiq-cloudwatchmetrics-2.6.0/lib/sidekiq/cloudwatchmetrics.rb:235 in block in publish
/gems/sidekiq-cloudwatchmetrics-2.6.0/lib/sidekiq/cloudwatchmetrics.rb:234 in each
/gems/sidekiq-cloudwatchmetrics-2.6.0/lib/sidekiq/cloudwatchmetrics.rb:234 in each_slice
/gems/sidekiq-cloudwatchmetrics-2.6.0/lib/sidekiq/cloudwatchmetrics.rb:234 in publish
/gems/sidekiq-cloudwatchmetrics-2.6.0/lib/sidekiq/cloudwatchmetrics.rb:77 in run
/gems/sidekiq-6.5.12/lib/sidekiq/component.rb:8 in watchdog
/gems/sidekiq-6.5.12/lib/sidekiq/component.rb:17 in block in safe_thread

@mullermp
Copy link
Contributor

I don't think that backtrace is helpful. One thing you can try is to add http_wire_trace: true to your client to see wire logs, and perhaps even add the http_debug_output to an instance of the instance profile credentials. This might give clues on what requests are being made. Note that these do not redact your credentials so be careful with logging and sharing.

@alextwoods
Copy link
Contributor

Yeah, if possible adding http_debug_output to an instance of the instance profile credentials would be the most useful as otherwise theres no logging from the credential refresh. My best guess is that something unusual is going on with the refresh (maybe an error from the instance thats getting ignored?).

If you can't enable http_debug_output on an instance of the instance profile credentials that you're using to configure your cloudwatch client because of logging concerns, you could at least provide a before_refresh proc:

before_refresh = proc do |creds| # called with self
  # do some logging....
end
credentials = Aws::InstanceProfileCredentials.new(before_refresh: before_refresh). 
client = Aws::Cloudwatch::Client.new(credentials: credentials)

@Roy-Gal-Git
Copy link
Author

Are the http_debug_output logs outputted using the default rails logger? If so, is there some sort of way for me to output those logs to a specific file?

@alextwoods
Copy link
Contributor

You can pass an instance of a logger (or technically anything that responds to <<) as the http_debug_output, eg:

credentials = Aws::InstanceProfileCredentials.new(http_debug_output: Rails.logger)

But again as a general warning - this will log the actual requests and responses which include credentials.

@Roy-Gal-Git
Copy link
Author

Hey @alextwoods
I had to verify it with our management team and sadly I can't do any interaction related to http_debug_output.

It seems like the current direction of resolving this issue is using the forked version of sidekiq-cloudwatchmetrics that contains my fix.

If I'll find that it does solve the issue, would it prove the need for a fix on your side?

Is there something else I can do to provide extra information?

@mullermp
Copy link
Contributor

I think we would want proof and root cause before taking a patch. Your fix in the other gem is a brute force solution which just retries after refreshing. I am inclined to believe it is something with your setup or environment.

@alextwoods
Copy link
Contributor

@Roy-Gal-Git - yeah that makes sense. I'd recommend in that case at least adding a before_refresh proc that logs when refreshes are happening (see above: #3162 (comment)).

I do think it will be useful to know whether you're sidekiq patch does address the issue.

An additional possible cause - time drift on your instance - if the system clock is off, the refresh of credentials on in the instance itself will be incorrect. Are you running anything like chrony, Amazon time Sync service or some other NTP setup?

@mullermp mullermp added response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. and removed investigating Issue is being investigated bug This issue is a bug. labels Jan 21, 2025
@Roy-Gal-Git
Copy link
Author

Roy-Gal-Git commented Jan 23, 2025

@alextwoods

I do think it will be useful to know whether you're sidekiq patch does address the issue.

I see that the current label is response-requested but It'll probably take more than a week for me to gain new insights from our environments...

Are you running anything like chrony, Amazon time Sync service or some other NTP setup?

No, we don't use those.

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Jan 24, 2025
@Roy-Gal-Git
Copy link
Author

@mullermp @alextwoods I have 2 small updates:

  1. We found 2 more reoccurring errors that might be of the same root cause:
  • Aws::Errors::MissingCredentialsError: unable to sign request without credentials set
    Backtrace:
/gems/aws-sdk-core-3.212.0/lib/aws-sdk-core/plugins/sign.rb:119 in rescue in initialize
/gems/aws-sdk-core-3.212.0/lib/aws-sdk-core/plugins/sign.rb:108 in initialize
/gems/aws-sdk-core-3.212.0/lib/aws-sdk-core/plugins/sign.rb:33 in new
/gems/aws-sdk-core-3.212.0/lib/aws-sdk-core/plugins/sign.rb:33 in signer_for
/gems/aws-sdk-core-3.212.0/lib/aws-sdk-core/plugins/sign.rb:45 in call
/gems/aws-sdk-core-3.212.0/lib/aws-sdk-core/plugins/transfer_encoding.rb:27 in call
/gems/aws-sdk-core-3.212.0/lib/aws-sdk-core/plugins/helpful_socket_errors.rb:12 in call
/gems/aws-sdk-s3-1.169.0/lib/aws-sdk-s3/plugins/s3_signer.rb:53 in call
/gems/aws-sdk-s3-1.169.0/lib/aws-sdk-s3/plugins/redirects.rb:20 in call
/gems/aws-sdk-core-3.212.0/lib/aws-sdk-core/plugins/user_agent.rb:69 in call
/gems/aws-sdk-core-3.212.0/lib/aws-sdk-core/plugins/retry_errors.rb:365 in block in call
/gems/aws-sdk-core-3.212.0/lib/aws-sdk-core/plugins/user_agent.rb:60 in metric
/gems/aws-sdk-core-3.212.0/lib/aws-sdk-core/plugins/retry_errors.rb:385 in with_metric
/gems/aws-sdk-core-3.212.0/lib/aws-sdk-core/plugins/retry_errors.rb:365 in call
/gems/aws-sdk-s3-1.169.0/lib/aws-sdk-s3/plugins/md5s.rb:32 in call
/gems/aws-sdk-core-3.212.0/lib/aws-sdk-core/plugins/http_checksum.rb:20 in call
/gems/aws-sdk-core-3.212.0/lib/aws-sdk-core/plugins/endpoint_pattern.rb:30 in call
/gems/aws-sdk-core-3.212.0/lib/aws-sdk-core/plugins/checksum_algorithm.rb:137 in call
/gems/aws-sdk-core-3.212.0/lib/aws-sdk-core/plugins/request_compression.rb:94 in block in call
/gems/aws-sdk-core-3.212.0/lib/aws-sdk-core/plugins/request_compression.rb:104 in with_metric
/gems/aws-sdk-core-3.212.0/lib/aws-sdk-core/plugins/request_compression.rb:94 in call
/gems/aws-sdk-core-3.212.0/lib/aws-sdk-core/rest/content_type_handler.rb:27 in call
/gems/aws-sdk-s3-1.169.0/lib/aws-sdk-s3/plugins/express_session_auth.rb:50 in block in call
/gems/aws-sdk-s3-1.169.0/lib/aws-sdk-s3/plugins/express_session_auth.rb:56 in with_metric
/gems/aws-sdk-s3-1.169.0/lib/aws-sdk-s3/plugins/express_session_auth.rb:50 in call
/gems/aws-sdk-s3-1.169.0/lib/aws-sdk-s3/plugins/expect_100_continue.rb:23 in call
/gems/aws-sdk-s3-1.169.0/lib/aws-sdk-s3/plugins/bucket_name_restrictions.rb:21 in call
/gems/aws-sdk-core-3.212.0/lib/aws-sdk-core/rest/handler.rb:10 in call
/gems/aws-sdk-core-3.212.0/lib/aws-sdk-core/plugins/recursion_detection.rb:18 in call
/gems/aws-sdk-s3-1.169.0/lib/aws-sdk-s3/plugins/endpoints.rb:52 in block in call
/gems/aws-sdk-core-3.212.0/lib/aws-sdk-core/plugins/user_agent.rb:60 in metric
/gems/aws-sdk-s3-1.169.0/lib/aws-sdk-s3/plugins/endpoints.rb:66 in with_metrics
/gems/aws-sdk-s3-1.169.0/lib/aws-sdk-s3/plugins/endpoints.rb:52 in call
/gems/aws-sdk-core-3.212.0/lib/aws-sdk-core/plugins/endpoint_discovery.rb:84 in call
/gems/aws-sdk-core-3.212.0/lib/seahorse/client/plugins/endpoint.rb:46 in call
/gems/aws-sdk-core-3.212.0/lib/aws-sdk-core/plugins/param_validator.rb:26 in call
/gems/aws-sdk-core-3.212.0/lib/seahorse/client/plugins/raise_response_errors.rb:16 in call
/gems/aws-sdk-s3-1.169.0/lib/aws-sdk-s3/plugins/sse_cpk.rb:24 in call
/gems/aws-sdk-s3-1.169.0/lib/aws-sdk-s3/plugins/dualstack.rb:21 in call
/gems/aws-sdk-s3-1.169.0/lib/aws-sdk-s3/plugins/accelerate.rb:43 in call
/gems/aws-sdk-core-3.212.0/lib/aws-sdk-core/plugins/checksum_algorithm.rb:111 in call
/gems/aws-sdk-core-3.212.0/lib/aws-sdk-core/plugins/jsonvalue_converter.rb:16 in call
/gems/aws-sdk-core-3.212.0/lib/aws-sdk-core/plugins/invocation_id.rb:16 in call
/gems/aws-sdk-core-3.212.0/lib/aws-sdk-core/plugins/idempotency_token.rb:19 in call
/gems/aws-sdk-core-3.212.0/lib/aws-sdk-core/plugins/param_converter.rb:26 in call
/gems/aws-sdk-core-3.212.0/lib/seahorse/client/plugins/request_callback.rb:89 in call
/gems/aws-sdk-core-3.212.0/lib/aws-sdk-core/plugins/response_paging.rb:12 in call
/gems/aws-sdk-core-3.212.0/lib/seahorse/client/plugins/response_target.rb:24 in call
/gems/aws-sdk-core-3.212.0/lib/aws-sdk-core/plugins/telemetry.rb:39 in block in call
/gems/aws-sdk-core-3.212.0/lib/aws-sdk-core/telemetry/no_op.rb:29 in in_span
/gems/aws-sdk-core-3.212.0/lib/aws-sdk-core/plugins/telemetry.rb:53 in span_wrapper
/gems/aws-sdk-core-3.212.0/lib/aws-sdk-core/plugins/telemetry.rb:39 in call
/gems/aws-sdk-core-3.212.0/lib/seahorse/client/request.rb:72 in send_request
/gems/aws-sdk-s3-1.169.0/lib/aws-sdk-s3/client.rb:16571 in put_object
/gems/aws-sdk-s3-1.169.0/lib/aws-sdk-s3/file_uploader.rb:66 in block in put_object
/gems/aws-sdk-s3-1.169.0/lib/aws-sdk-s3/file_uploader.rb:55 in block in open_file
/gems/aws-sdk-s3-1.169.0/lib/aws-sdk-s3/file_uploader.rb:55 in open
/gems/aws-sdk-s3-1.169.0/lib/aws-sdk-s3/file_uploader.rb:55 in open_file
/gems/aws-sdk-s3-1.169.0/lib/aws-sdk-s3/file_uploader.rb:65 in put_object
/gems/aws-sdk-s3-1.169.0/lib/aws-sdk-s3/file_uploader.rb:46 in block in upload
/gems/aws-sdk-core-3.212.0/lib/aws-sdk-core/plugins/user_agent.rb:60 in metric
/gems/aws-sdk-s3-1.169.0/lib/aws-sdk-s3/file_uploader.rb:40 in upload
/gems/aws-sdk-s3-1.169.0/lib/aws-sdk-s3/customizations/object.rb:477 in block in upload_file
/gems/aws-sdk-core-3.212.0/lib/aws-sdk-core/plugins/user_agent.rb:60 in metric
/gems/aws-sdk-s3-1.169.0/lib/aws-sdk-s3/customizations/object.rb:476 in upload_file

And

  • Aws::Sigv4::Errors::MissingCredentialsError: missing credentials, provide credentials with one of the following options: - :access_key_id and :secret_access_key - :credentials - :credentials_provider
    Backtrace:
/gems/aws-sigv4-1.9.1/lib/aws-sigv4/signer.rb:717 in extract_credentials_provider
/gems/aws-sigv4-1.9.1/lib/aws-sigv4/signer.rb:147 in initialize
/gems/aws-sdk-s3-1.113.2/lib/aws-sdk-s3/plugins/s3_signer.rb:235 in new
/gems/aws-sdk-s3-1.113.2/lib/aws-sdk-s3/plugins/s3_signer.rb:235 in build_v4_signer
/gems/aws-sdk-s3-1.113.2/lib/aws-sdk-s3/plugins/s3_signer.rb:14 in block in <class:S3Signer>
/gems/aws-sdk-core-3.204.0/lib/seahorse/client/configuration.rb:72 in call
/gems/aws-sdk-core-3.204.0/lib/seahorse/client/configuration.rb:215 in block in resolve_defaults
/gems/aws-sdk-core-3.204.0/lib/seahorse/client/configuration.rb:59 in each
/gems/aws-sdk-core-3.204.0/lib/seahorse/client/configuration.rb:59 in each
/gems/aws-sdk-core-3.204.0/lib/seahorse/client/configuration.rb:214 in resolve_defaults
/gems/aws-sdk-core-3.204.0/lib/seahorse/client/configuration.rb:207 in value_at
/gems/aws-sdk-core-3.204.0/lib/seahorse/client/configuration.rb:191 in block in resolve
/usr/local/lib/ruby/3.2.0/set.rb:511 in each_key
/usr/local/lib/ruby/3.2.0/set.rb:511 in each
/gems/aws-sdk-core-3.204.0/lib/seahorse/client/configuration.rb:191 in resolve
/gems/aws-sdk-core-3.204.0/lib/seahorse/client/configuration.rb:179 in apply_defaults
/gems/aws-sdk-core-3.204.0/lib/seahorse/client/configuration.rb:152 in build!
/gems/aws-sdk-core-3.204.0/lib/seahorse/client/base.rb:65 in build_config
/gems/aws-sdk-core-3.204.0/lib/seahorse/client/base.rb:21 in initialize
/gems/aws-sdk-s3-1.113.2/lib/aws-sdk-s3/client.rb:435 in initialize
/gems/aws-sdk-core-3.204.0/lib/seahorse/client/base.rb:102 in new
/gems/aws-sdk-s3-1.113.2/lib/aws-sdk-s3/resource.rb:28 in initialize

Do you think that the 3 errors are related in any kind of way?

  1. My sidekiq-cloudwatchmetrics patch is running in one of our environments for 2 weeks now. The (original) error didn't occur yet. I don't think that it's enough time to assume that the fix does work, but it looks like a good direction.

@alextwoods
Copy link
Contributor

With the test for you are doing with the patch, are you also running with the before_refresh logging?

If the patch does work, we'll want to understand if the SDK is trying to refresh credentials already.

Those errors could be related - in general both are caused by some environment setup issue that results in the SDK being unable to find valid credentials during client initialization. Assuming these errors are happening from code running on EC2/ECS it would indicate a potential issue with IMDS not having valid credentials. This could happen for example if there is a clock skew issue on the host or something else is misconfigured.

@Roy-Gal-Git
Copy link
Author

@alextwoods

With the test for you are doing with the patch, are you also running with the before_refresh logging?

No, it's just the patch.

Assuming these errors are happening from code running on EC2/ECS it would indicate a potential issue with IMDS not having valid credentials.

It runs on EC2 machines that are managed by an Elastic Beanstalk environment.

@mullermp
Copy link
Contributor

mullermp commented Feb 6, 2025

Your previous code had expired credentials (a valid ruby object with just invalid credentials) but that new error indicates your entire credentials object has been blown away. Here is where that is raised: https://github.com/aws/aws-sdk-ruby/blob/version-3/gems%2Faws-sdk-core%2Flib%2Faws-sdk-core%2Fplugins%2Fsign.rb. I do think it's still something with your setup or environment - no SDK code will remove credentials from configuration that I'm aware of.

@Roy-Gal-Git
Copy link
Author

@mullermp Interesting!

Is it possible that initializing Aws::S3::Resource objects with access_key_id and secret_access_key set to nil could be causing this issue?

Aws::S3::Resource.new(region: 'REGION_PLACEHOLDER', access_key_id: nil, secret_access_key: nil)

I also noticed that sometimes we pass a hardcoded region, while other times we use a memoized instance of Aws::EC2Metadata to fetch the EC2 instance's region from the metadata API. Could this discrepancy contribute to the credential drifts you mentioned?

For context, we always use the latest AMI for our Elastic Beanstalk environments and don’t make any manual changes to it. In practice, we run everything with Docker on ECS. Could something in this setup be affecting credential persistence?

@mullermp
Copy link
Contributor

mullermp commented Feb 9, 2025

Why are you setting a fake region and nil credentials in your client/resource?

@Roy-Gal-Git
Copy link
Author

@mullermp I just changed the region to the placeholder. It's 'us-east-1' if it matters.
The credentials are nil so they'll fetched from the default credential provider chain.

@Roy-Gal-Git
Copy link
Author

@alextwoods @mullermp

An additional possible cause - time drift on your instance - if the system clock is off, the refresh of credentials on in the instance itself will be incorrect. Are you running anything like chrony, Amazon time Sync service or some other NTP setup?

I was wrong when I said we don't use those services.

The instance is using Amazon Time Sync Service (169.254.169.123) with Chrony. The system time is accurate, with an offset of only a few nanoseconds, and chronyc tracking confirms synchronization is stable. Time drift is not an issue.

@mullermp
Copy link
Contributor

You should not need to set any keys to nil at all for the credentials provider chain to work. They can be omitted. At some point, your credentials are being blown away. I'm not sure why, but perhaps usage in your code base that we cannot see.

@Roy-Gal-Git
Copy link
Author

So is there anything else I can provide for you to be able to investigate this further?

@mullermp
Copy link
Contributor

I can't really think of any, I'm sorry. The code in the cloudwatch metrics sidekiq gem looks fine - it just takes a client instance parameter and calls an operation. Are you always supplying that parameter of your created client? You should only need to create the client once, perhaps as a singleton early in your app. Confirm it is actually resolving instance profile credentials and confirm you can call refresh! on it and it produces valid credentials. If you had an instance of the credentials object, you could even call it yourself to force refresh somewhere. Do you have a multithreaded application? Do you modify your client anywhere in your code?

@Roy-Gal-Git
Copy link
Author

Confirm it is actually resolving instance profile credentials and confirm you can call refresh! on it and it produces valid credentials.

I already confirmed both, which is why I implemented this patch for the sidekiq-cloudwatchmetrics gem.

Are you always supplying that parameter of your created client? You should only need to create the client once, perhaps as a singleton early in your app.

Yes, we always supply the client parameter. The AWS CloudWatch client is created once in our Sidekiq initializer and reused within the Sidekiq process.

aws_region = AWSable.get_region # Memoized `Aws::EC2Metadata` object fetches region
Sidekiq::CloudWatchMetrics.enable!(
  client: Aws::CloudWatch::Client.new(region: aws_region),
  namespace: "FILTERED" # Placeholder value
)

Confirm it is actually resolving instance profile credentials and confirm you can call refresh! on it and it produces valid credentials.

Same here. Since deploying the patched version, we have had 0 occurrences of:
Aws::CloudWatch::Errors::ExpiredToken: The security token included in the request is expired

Do you have a multithreaded application?

Our application does not explicitly use multithreading. However, some gems may internally use background threads.

Do you modify your client anywhere in your code?

No, we pass it directly in the Sidekiq initializer (see the code block above).

@mullermp
Copy link
Contributor

Can you try creating an instance of your client outside of enable! and pass it in? I'm not too familiar with Sidekiq, but looking at the code, you may be creating multiple clients there based on thread configuration?

With that patch, just to confirm, you went from expired credentials error to no credentials are configured, correct?

@Roy-Gal-Git
Copy link
Author

Roy-Gal-Git commented Feb 13, 2025

Can you try creating an instance of your client outside of enable! and pass it in? I'm not too familiar with Sidekiq, but looking at the code, you may be creating multiple clients there based on thread configuration?

I don’t think it matters whether the client is created inside or outside enable!, since it's still part of the initializer and gets passed as an argument. The Aws::CloudWatch::Client object is created once per Sidekiq process, and since Sidekiq runs multiple worker processes (not threads), each process has its own isolated instance of the client in memory.

Even if Sidekiq runs multiple threads within a process, the AWS SDK client is thread-safe, so it should not cause any issues.

With that patch, just to confirm, you went from expired credentials error to no credentials are configured, correct?

No, the no credentials are configured is another problem we have that I thought might result from the same root cause.
It happens in aws-sdk-s3 and aws-sigv4 contexts (I attached the backtraces not long ago).

@alextwoods
Copy link
Contributor

Its good to know that the patch is working and seems to have eliminated the issue - but its also a bit confusing.

The only possible reason I could see for the patch to fix the issue is if the credentials have an expiration time in the future and so don't trigger the refresh logic but the service considers them expired/invalid.

I think there are two things you can do to help us understand what might be going on:

  1. Add logging with before_refresh as mentioned above. I think this will at least help track when refresh is getting called on your credentials.
  2. In your patch, log the credential's expiration before forcing a refresh. something like:
            logger.warn("#{@client.class} security token expired (expiration: #{@client.config.credentials.expiration}). Refreshing client and retrying... (attempt #{retry_count})")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants