Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

amazon-cloudwatch-agent doesnt respect SIGTERM to flush metrics #961

Open
dipeshgautamsol opened this issue Nov 15, 2023 · 8 comments
Open

Comments

@dipeshgautamsol
Copy link

dipeshgautamsol commented Nov 15, 2023

Describe the bug

I am trying to run the aws-cloudwatch-agent as part of a lambda extension, so that metrics sent by lambda executions are batched.
Basically, I have a go program that runs this command under the hood with some static configuration files that are pasted below
amazon-cloudwatch-agent -config config.toml -otelconfig config.yaml
So far, the bootup works. However, the problem is that the agent doesnt seem to flush the metrics when it receives a SIGTERM signal. It just seems to close all extensions and shutdown without flushing any metrics to cloudwatch. This is really problematic for using inside a lambda extension because until a lambda environment shutdown if the flush interval is not reached, metrics are lost completely

Apparently, the telegraf agent understands a SIGTERM or SIGUSR1 and flushes its cache before shutting down.

This issue already goes in this direction.

What did you expect to see?
The metrics should be flushed during shutdown

What did you see instead?
The metrics are lost

What config did you use?

[agent]
  collection_jitter = "0s"
  debug = false
  flush_interval = "1s"
  flush_jitter = "0s"
  hostname = ""
  interval = "60s"
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  omit_hostname = true
  precision = ""
  quiet = false
  round_interval = false

[inputs]

  [[inputs.statsd]]
    interval = "10s"
    parse_data_dog_tags = true
    service_address = ":8125"
    [inputs.statsd.tags]
      "aws:AggregationInterval" = "10s"

[outputs]

  [[outputs.cloudwatch]]

config.yaml

connectors: {}
exporters:
    awscloudwatch:
        force_flush_interval: 10s
        max_datums_per_call: 1000
        max_values_per_datum: 150
        namespace: somewhere
        region: eu-central-1
        resource_to_telemetry_conversion:
            enabled: true
extensions: {}
processors: {}
receivers:
    telegraf_statsd:
        collection_interval: 10s
        initial_delay: 1s
        timeout: 0s
service:
    extensions: []
    pipelines:
        metrics/host:
            exporters:
                - awscloudwatch
            processors: []
            receivers:
                - telegraf_statsd
@dipeshgautamsol dipeshgautamsol changed the title amazon-cloudwatch-agent doesnt respect SIGTEM to flush metrics amazon-cloudwatch-agent doesnt respect SIGTERM to flush metrics Nov 15, 2023
@dipeshgautamsol
Copy link
Author

dipeshgautamsol commented Nov 15, 2023

here are the shutdown logs. I would have assumed the flushing to have happened between Stopping and Stopped state of Cloudwatch output plugin

{\"caller\":\"otelcol/collector.go:250\",\"msg\":\"Received signal from OS\",\"signal\":\"terminated\"}\n2023-11-15T17:09:17Z I! {\"caller\":\"service/service.go:170\",\"msg\":\"Starting shutdown...\"}\n","time":"2023-11-15T17:09:17Z"}
Profiler dump:\n[no stats is available...]\n","time":"2023-11-15T17:09:17Z"}
{\"caller\":\"adapter/receiver.go:76\",\"msg\":\"Shutdown adapter\",\"kind\":\"receiver\",\"name\":\"telegraf_statsd\",\"data_type\":\"metrics\",\"receiver\":\"statsd\"}\n2023-11-15T17:09:17Z D! Stopping the statsd service\n","time":"2023-11-15T17:09:17Z"}
Stopped the statsd service\n","time":"2023-11-15T17:09:17Z"}
Profiler is stopped during shutdown\n","time":"2023-11-15T17:09:17Z"}
Stopping the CloudWatch output plugin\n","time":"2023-11-15T17:09:17Z"}
Stopped the CloudWatch output plugin\n","time":"2023-11-15T17:09:17Z"}
{\"caller\":\"extensions/extensions.go:45\",\"msg\":\"Stopping extensions...\"}\n2023-11-15T17:09:17Z I! {\"caller\":\"service/service.go:184\",\"msg\":\"Shutdown complete.\"}\n","time":"2023-11-15T17:09:17Z"}

@okankoAMZ
Copy link
Contributor

Hello,

Is it possible for you to provide as with cloudwatch-agent logs from /opt/aws/amazon-cloudwatch-agent/logs/amazon-cloudwatch-agent.log . This will give us more insight about the issue. After that I'll investigate the shutdown mechanism to find the root cause.

@dipeshgautamsol
Copy link
Author

The logs posted above should be the same logs as written in this logfile. Since Im running the agent in my process, the logs are part of the output from my process

@dipeshgautamsol
Copy link
Author

dipeshgautamsol commented Nov 16, 2023

Here are the steps to reproduce it locally

  1. Build the cw agent binaries locally
  2. Using the configuration files above, run this command amazon-cloudwatch-agent -config config.toml -otelconfig amazon-cloudwatch-agent.yaml. The flush interval is set to 10seconds.
  3. Emit some metrics. Here is a python script that emits metrics
# pip3 install datadog
import time
import random
from datadog import initialize, statsd

def lambda_handler(event, context):
    UDP_IP = "127.0.0.1"
    UDP_PORT = 8125
 
    options = {
        "statsd_host": "127.0.0.1",
        "statsd_port": 8125,
        "disable_telemetry" :True,
        "statsd_constant_tags": [
            "Owner:someone"
        ]
    }
    
    initialize(**options)
    for lp in range(100):
        time.sleep(0.02)
        statsd.increment("early_shutdown", tags= ["SubProcess:write-timestream-metric"])
  1. Run the metrics emitter. Leave the agent running for 10seconds. Metrics are exported
  2. Run the metrics emitter. Send a sigterm to the agent before 10seconds. Metrics are lost

Shutdown sequence

2023-11-16T09:35:48Z D! [logagent] open file count, 0
^C2023-11-16T09:35:49Z D! Profiler dump:
[no stats is available...]
2023-11-16T09:35:49Z I! Profiler is stopped during shutdown
2023-11-16T10:35:49.123+0100	info	otelcol/collector.go:250	Received signal from OS	{"signal": "interrupt"}
2023-11-16T10:35:49.123+0100	info	service/service.go:170	Starting shutdown...
2023-11-16T09:35:49Z D! Stopping the statsd service
2023-11-16T09:35:49Z D! Stopped the statsd service
2023-11-16T09:35:49Z D! Stopping the CloudWatch output plugin
2023-11-16T09:35:49Z D! Stopped the CloudWatch output plugin
2023-11-16T09:35:49Z D! [outputs.cloudwatch] LogThrottleRetryer watch throttle events goroutine exiting
2023-11-16T10:35:49.156+0100	info	extensions/extensions.go:45	Stopping extensions...
2023-11-16T10:35:49.156+0100	info	service/service.go:184	Shutdown complete.

Also note that the behavior is the same if I set the intervals of telegraf to 1s (keeping the otel flush interval at 10sec).

Sometimes:
If we wait for a bit, sometimes there are hints of metrics being flushed but they take more than a couple of seconds to flush but this is not consistent. I am not sure if its the Otel collector or the statsd

2023-11-16T09:49:47Z D! Stopping the CloudWatch output plugin
2023-11-16T09:49:47Z D! CloudWatch Close, 0th time to sleep since there is still some metric data remaining to publish.
2023-11-16T09:49:48Z D! CloudWatch Close, 1th time to sleep since there is still some metric data remaining to publish.
2023-11-16T09:49:49Z D! CloudWatch Close, 2th time to sleep since there is still some metric data remaining to publish.
2023-11-16T09:49:50Z D! CloudWatch Close, 3th time to sleep since there is still some metric data remaining to publish.
2023-11-16T09:49:51Z D! Stopped the CloudWatch output plugin

Also we cant afford 3 seconds of flush time, because a lambda extension is shutdown by AWS lambda after a max of 2seconds

@okankoAMZ
Copy link
Contributor

Hello, I am working on recreating the issue and wanted to double check few points:

  1. What os were you using to build the agent
  2. What go version did you use to build the agent
  3. Have you tried using a already built agent and does the issue continue to exist there?

@dipeshgautamsol
Copy link
Author

dipeshgautamsol commented Nov 17, 2023

FROM golang:1.21-alpine3.18 as builder
RUN git clone https://github.com/aws/amazon-cloudwatch-agent.git
RUN cd amazon-cloudwatch-agent && CGO_ENABLED=0 go build -trimpath -buildmode=default\
    -ldflags="-X github.com/aws/amazon-cloudwatch-agent/cfg/agentinfo.VersionStr=2.6.0"\
    -o /opt/extensions/bin/amazon-cloudwatch-agent github.com/aws/amazon-cloudwatch-agent/cmd/amazon-cloudwatch-agent

Built the agent on a darwin arm64 and linux x86_64 as well, both with the same effect

@dipeshgautamsol
Copy link
Author

Hello @okankoAMZ were you able to somehow reproduce the issue?

@okankoAMZ
Copy link
Contributor

Hi!
I have been able to reproduce it yet; however, I have created a backlog item for us to investigate this issue.

@okankoAMZ okankoAMZ added the status/backlog Backlog label Feb 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants