amazon-cloudwatch-agent doesnt respect SIGTERM to flush metrics #961

dipeshgautamsol · 2023-11-15T16:09:08Z

Describe the bug

I am trying to run the aws-cloudwatch-agent as part of a lambda extension, so that metrics sent by lambda executions are batched.
Basically, I have a go program that runs this command under the hood with some static configuration files that are pasted below
amazon-cloudwatch-agent -config config.toml -otelconfig config.yaml
So far, the bootup works. However, the problem is that the agent doesnt seem to flush the metrics when it receives a SIGTERM signal. It just seems to close all extensions and shutdown without flushing any metrics to cloudwatch. This is really problematic for using inside a lambda extension because until a lambda environment shutdown if the flush interval is not reached, metrics are lost completely

Apparently, the telegraf agent understands a SIGTERM or SIGUSR1 and flushes its cache before shutting down.

This issue already goes in this direction.

What did you expect to see?
The metrics should be flushed during shutdown

What did you see instead?
The metrics are lost

What config did you use?

[agent]
  collection_jitter = "0s"
  debug = false
  flush_interval = "1s"
  flush_jitter = "0s"
  hostname = ""
  interval = "60s"
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  omit_hostname = true
  precision = ""
  quiet = false
  round_interval = false

[inputs]

  [[inputs.statsd]]
    interval = "10s"
    parse_data_dog_tags = true
    service_address = ":8125"
    [inputs.statsd.tags]
      "aws:AggregationInterval" = "10s"

[outputs]

  [[outputs.cloudwatch]]

config.yaml

connectors: {}
exporters:
    awscloudwatch:
        force_flush_interval: 10s
        max_datums_per_call: 1000
        max_values_per_datum: 150
        namespace: somewhere
        region: eu-central-1
        resource_to_telemetry_conversion:
            enabled: true
extensions: {}
processors: {}
receivers:
    telegraf_statsd:
        collection_interval: 10s
        initial_delay: 1s
        timeout: 0s
service:
    extensions: []
    pipelines:
        metrics/host:
            exporters:
                - awscloudwatch
            processors: []
            receivers:
                - telegraf_statsd

The text was updated successfully, but these errors were encountered:

dipeshgautamsol · 2023-11-15T17:15:51Z

here are the shutdown logs. I would have assumed the flushing to have happened between Stopping and Stopped state of Cloudwatch output plugin

{\"caller\":\"otelcol/collector.go:250\",\"msg\":\"Received signal from OS\",\"signal\":\"terminated\"}\n2023-11-15T17:09:17Z I! {\"caller\":\"service/service.go:170\",\"msg\":\"Starting shutdown...\"}\n","time":"2023-11-15T17:09:17Z"}
Profiler dump:\n[no stats is available...]\n","time":"2023-11-15T17:09:17Z"}
{\"caller\":\"adapter/receiver.go:76\",\"msg\":\"Shutdown adapter\",\"kind\":\"receiver\",\"name\":\"telegraf_statsd\",\"data_type\":\"metrics\",\"receiver\":\"statsd\"}\n2023-11-15T17:09:17Z D! Stopping the statsd service\n","time":"2023-11-15T17:09:17Z"}
Stopped the statsd service\n","time":"2023-11-15T17:09:17Z"}
Profiler is stopped during shutdown\n","time":"2023-11-15T17:09:17Z"}
Stopping the CloudWatch output plugin\n","time":"2023-11-15T17:09:17Z"}
Stopped the CloudWatch output plugin\n","time":"2023-11-15T17:09:17Z"}
{\"caller\":\"extensions/extensions.go:45\",\"msg\":\"Stopping extensions...\"}\n2023-11-15T17:09:17Z I! {\"caller\":\"service/service.go:184\",\"msg\":\"Shutdown complete.\"}\n","time":"2023-11-15T17:09:17Z"}

okankoAMZ · 2023-11-15T17:20:26Z

Hello,

Is it possible for you to provide as with cloudwatch-agent logs from /opt/aws/amazon-cloudwatch-agent/logs/amazon-cloudwatch-agent.log . This will give us more insight about the issue. After that I'll investigate the shutdown mechanism to find the root cause.

dipeshgautamsol · 2023-11-15T22:11:39Z

The logs posted above should be the same logs as written in this logfile. Since Im running the agent in my process, the logs are part of the output from my process

dipeshgautamsol · 2023-11-16T09:41:21Z

Here are the steps to reproduce it locally

Build the cw agent binaries locally
Using the configuration files above, run this command amazon-cloudwatch-agent -config config.toml -otelconfig amazon-cloudwatch-agent.yaml. The flush interval is set to 10seconds.
Emit some metrics. Here is a python script that emits metrics

# pip3 install datadog
import time
import random
from datadog import initialize, statsd

def lambda_handler(event, context):
    UDP_IP = "127.0.0.1"
    UDP_PORT = 8125
 
    options = {
        "statsd_host": "127.0.0.1",
        "statsd_port": 8125,
        "disable_telemetry" :True,
        "statsd_constant_tags": [
            "Owner:someone"
        ]
    }
    
    initialize(**options)
    for lp in range(100):
        time.sleep(0.02)
        statsd.increment("early_shutdown", tags= ["SubProcess:write-timestream-metric"])

Run the metrics emitter. Leave the agent running for 10seconds. Metrics are exported
Run the metrics emitter. Send a sigterm to the agent before 10seconds. Metrics are lost

Shutdown sequence

2023-11-16T09:35:48Z D! [logagent] open file count, 0
^C2023-11-16T09:35:49Z D! Profiler dump:
[no stats is available...]
2023-11-16T09:35:49Z I! Profiler is stopped during shutdown
2023-11-16T10:35:49.123+0100	info	otelcol/collector.go:250	Received signal from OS	{"signal": "interrupt"}
2023-11-16T10:35:49.123+0100	info	service/service.go:170	Starting shutdown...
2023-11-16T09:35:49Z D! Stopping the statsd service
2023-11-16T09:35:49Z D! Stopped the statsd service
2023-11-16T09:35:49Z D! Stopping the CloudWatch output plugin
2023-11-16T09:35:49Z D! Stopped the CloudWatch output plugin
2023-11-16T09:35:49Z D! [outputs.cloudwatch] LogThrottleRetryer watch throttle events goroutine exiting
2023-11-16T10:35:49.156+0100	info	extensions/extensions.go:45	Stopping extensions...
2023-11-16T10:35:49.156+0100	info	service/service.go:184	Shutdown complete.

Also note that the behavior is the same if I set the intervals of telegraf to 1s (keeping the otel flush interval at 10sec).

Sometimes:
If we wait for a bit, sometimes there are hints of metrics being flushed but they take more than a couple of seconds to flush but this is not consistent. I am not sure if its the Otel collector or the statsd

2023-11-16T09:49:47Z D! Stopping the CloudWatch output plugin
2023-11-16T09:49:47Z D! CloudWatch Close, 0th time to sleep since there is still some metric data remaining to publish.
2023-11-16T09:49:48Z D! CloudWatch Close, 1th time to sleep since there is still some metric data remaining to publish.
2023-11-16T09:49:49Z D! CloudWatch Close, 2th time to sleep since there is still some metric data remaining to publish.
2023-11-16T09:49:50Z D! CloudWatch Close, 3th time to sleep since there is still some metric data remaining to publish.
2023-11-16T09:49:51Z D! Stopped the CloudWatch output plugin

Also we cant afford 3 seconds of flush time, because a lambda extension is shutdown by AWS lambda after a max of 2seconds

okankoAMZ · 2023-11-16T19:57:52Z

Hello, I am working on recreating the issue and wanted to double check few points:

What os were you using to build the agent
What go version did you use to build the agent
Have you tried using a already built agent and does the issue continue to exist there?

dipeshgautamsol · 2023-11-17T08:23:22Z

FROM golang:1.21-alpine3.18 as builder
RUN git clone https://github.com/aws/amazon-cloudwatch-agent.git
RUN cd amazon-cloudwatch-agent && CGO_ENABLED=0 go build -trimpath -buildmode=default\
    -ldflags="-X github.com/aws/amazon-cloudwatch-agent/cfg/agentinfo.VersionStr=2.6.0"\
    -o /opt/extensions/bin/amazon-cloudwatch-agent github.com/aws/amazon-cloudwatch-agent/cmd/amazon-cloudwatch-agent

Built the agent on a darwin arm64 and linux x86_64 as well, both with the same effect

dipeshgautamsol · 2023-12-12T12:42:56Z

Hello @okankoAMZ were you able to somehow reproduce the issue?

okankoAMZ · 2024-02-07T19:56:37Z

Hi!
I have been able to reproduce it yet; however, I have created a backlog item for us to investigate this issue.

okankoAMZ added the under-investigation label Nov 15, 2023

dipeshgautamsol changed the title ~~amazon-cloudwatch-agent doesnt respect SIGTEM to flush metrics~~ amazon-cloudwatch-agent doesnt respect SIGTERM to flush metrics Nov 15, 2023

okankoAMZ added the status/backlog Backlog label Feb 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

amazon-cloudwatch-agent doesnt respect SIGTERM to flush metrics #961

amazon-cloudwatch-agent doesnt respect SIGTERM to flush metrics #961

dipeshgautamsol commented Nov 15, 2023 •

edited

Loading

dipeshgautamsol commented Nov 15, 2023 •

edited

Loading

okankoAMZ commented Nov 15, 2023

dipeshgautamsol commented Nov 15, 2023

dipeshgautamsol commented Nov 16, 2023 •

edited

Loading

okankoAMZ commented Nov 16, 2023

dipeshgautamsol commented Nov 17, 2023 •

edited

Loading

dipeshgautamsol commented Dec 12, 2023

okankoAMZ commented Feb 7, 2024

amazon-cloudwatch-agent doesnt respect SIGTERM to flush metrics #961

amazon-cloudwatch-agent doesnt respect SIGTERM to flush metrics #961

Comments

dipeshgautamsol commented Nov 15, 2023 • edited Loading

dipeshgautamsol commented Nov 15, 2023 • edited Loading

okankoAMZ commented Nov 15, 2023

dipeshgautamsol commented Nov 15, 2023

dipeshgautamsol commented Nov 16, 2023 • edited Loading

okankoAMZ commented Nov 16, 2023

dipeshgautamsol commented Nov 17, 2023 • edited Loading

dipeshgautamsol commented Dec 12, 2023

okankoAMZ commented Feb 7, 2024

dipeshgautamsol commented Nov 15, 2023 •

edited

Loading

dipeshgautamsol commented Nov 15, 2023 •

edited

Loading

dipeshgautamsol commented Nov 16, 2023 •

edited

Loading

dipeshgautamsol commented Nov 17, 2023 •

edited

Loading