Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Agent crash when OTLP messages are received #1501

Open
tyron opened this issue Jan 14, 2025 · 1 comment
Open

Agent crash when OTLP messages are received #1501

tyron opened this issue Jan 14, 2025 · 1 comment

Comments

@tyron
Copy link

tyron commented Jan 14, 2025

Describe the bug

Agents that are configured with OpenTelemetry Collector are crashing when receiving metrics.
I was being impacted by #1435 , so I already updated agents to version 1.300051.0. Still, I noticed that right before the error (below) there's a retry attempt to IMDS, so I wonder if these are still related.

Steps to reproduce

  • Launch an EC2 instance (enforced to use IMDSv2, not sure if that matters)
  • Configure agent with OLTP and configure a client to send metrics to this endpoint (in my case, I'm configuring Terraform Enterprise Agents).

What did you expect to see?
CloudWatch agent able to collect the metrics.

What did you see instead?
journalctl -u amazon-cloudwatch-agent -f outputs:

Jan 14 20:27:33 ip-10-14-21-116 systemd[1]: Started Amazon CloudWatch Agent.
Jan 14 20:27:33 ip-10-14-21-116 start-amazon-cloudwatch-agent[24919]: D! [EC2] Found active network interface
Jan 14 20:27:33 ip-10-14-21-116 start-amazon-cloudwatch-agent[24919]: I! imds retry client will retry 1 timesI! Detected the instance is EC2
Jan 14 20:27:33 ip-10-14-21-116 start-amazon-cloudwatch-agent[24919]: 2025/01/14 20:27:33 Reading json config file path: /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json ...
Jan 14 20:27:33 ip-10-14-21-116 start-amazon-cloudwatch-agent[24919]: /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json does not exist or cannot read. Skipping it.
Jan 14 20:27:33 ip-10-14-21-116 start-amazon-cloudwatch-agent[24919]: 2025/01/14 20:27:33 Reading json config file path: /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.d/file_deploy-awslogsunified.json ...
Jan 14 20:27:33 ip-10-14-21-116 start-amazon-cloudwatch-agent[24919]: 2025/01/14 20:27:33 I! Valid Json input schema.
Jan 14 20:27:33 ip-10-14-21-116 start-amazon-cloudwatch-agent[24919]: I! Detecting run_as_user...
Jan 14 20:27:33 ip-10-14-21-116 start-amazon-cloudwatch-agent[24919]: I! Trying to detect region from ec2
Jan 14 20:27:33 ip-10-14-21-116 start-amazon-cloudwatch-agent[24919]: timestamp_format set file_path : /opt/aws/amazon-cloudwatch-agent/logs/amazon-cloudwatch-agent.log is the same as agent log file /opt/aws/amazon-cloudwatch-agent/logs/amazon-cloudwatch-agent.log thus do not use timestamp_layout
Jan 14 20:27:33 ip-10-14-21-116 start-amazon-cloudwatch-agent[24919]: timestamp_format set file_path : /opt/aws/amazon-cloudwatch-agent/logs/amazon-cloudwatch-agent.log is the same as agent log file /opt/aws/amazon-cloudwatch-agent/logs/amazon-cloudwatch-agent.log thus do not use timestamp_regex
Jan 14 20:27:33 ip-10-14-21-116 start-amazon-cloudwatch-agent[24919]: 2025/01/14 20:27:33 D! ec2tagger processor required because append_dimensions is set
Jan 14 20:27:33 ip-10-14-21-116 start-amazon-cloudwatch-agent[24919]: 2025/01/14 20:27:33 D! delta processor required because metrics with diskio or net are set
Jan 14 20:27:33 ip-10-14-21-116 start-amazon-cloudwatch-agent[24919]: 2025/01/14 20:27:33 D! ec2tagger processor required because append_dimensions is set
Jan 14 20:27:33 ip-10-14-21-116 start-amazon-cloudwatch-agent[24919]: 2025/01/14 20:27:33 D! delta processor required because metrics with diskio or net are set
Jan 14 20:27:33 ip-10-14-21-116 start-amazon-cloudwatch-agent[24919]: 2025/01/14 20:27:33 D! ec2tagger processor required because append_dimensions is set
Jan 14 20:27:33 ip-10-14-21-116 start-amazon-cloudwatch-agent[24919]: 2025/01/14 20:27:33 Configuration validation first phase succeeded
Jan 14 20:27:33 ip-10-14-21-116 start-amazon-cloudwatch-agent[24915]: I! Detecting run_as_user...
Jan 14 20:28:03 ip-10-14-21-116 start-amazon-cloudwatch-agent[24915]: I! imds retry client will retry 1 timesI! imds retry client will retry 1 timesI! imds retry client will retry 1 timesI! imds retry client will retry 1 timespanic: runtime error: invalid memory address or nil pointer dereference
Jan 14 20:28:03 ip-10-14-21-116 start-amazon-cloudwatch-agent[24915]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x310af22]
Jan 14 20:28:03 ip-10-14-21-116 start-amazon-cloudwatch-agent[24915]: goroutine 145 [running]:
Jan 14 20:28:03 ip-10-14-21-116 start-amazon-cloudwatch-agent[24915]: github.com/aws/amazon-cloudwatch-agent/plugins/outputs/cloudwatch.(*CloudWatch).BuildMetricDatum(0xc0011b4b40, 0xc001562fa0)
Jan 14 20:28:03 ip-10-14-21-116 start-amazon-cloudwatch-agent[24915]:         github.com/aws/amazon-cloudwatch-agent/plugins/outputs/cloudwatch/cloudwatch.go:431 +0x422
Jan 14 20:28:03 ip-10-14-21-116 start-amazon-cloudwatch-agent[24915]: github.com/aws/amazon-cloudwatch-agent/plugins/outputs/cloudwatch.(*CloudWatch).pushMetricDatum(0xc0011b4b40)
Jan 14 20:28:03 ip-10-14-21-116 start-amazon-cloudwatch-agent[24915]:         github.com/aws/amazon-cloudwatch-agent/plugins/outputs/cloudwatch/cloudwatch.go:175 +0x21a
Jan 14 20:28:03 ip-10-14-21-116 start-amazon-cloudwatch-agent[24915]: created by github.com/aws/amazon-cloudwatch-agent/plugins/outputs/cloudwatch.(*CloudWatch).startRoutines in goroutine 1
Jan 14 20:28:03 ip-10-14-21-116 start-amazon-cloudwatch-agent[24915]:         github.com/aws/amazon-cloudwatch-agent/plugins/outputs/cloudwatch/cloudwatch.go:131 +0x318
Jan 14 20:28:03 ip-10-14-21-116 systemd[1]: amazon-cloudwatch-agent.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Jan 14 20:28:03 ip-10-14-21-116 systemd[1]: amazon-cloudwatch-agent.service: Failed with result 'exit-code'.
Jan 14 20:29:03 ip-10-14-21-116 systemd[1]: amazon-cloudwatch-agent.service: Service RestartSec=1min expired, scheduling restart.
Jan 14 20:29:03 ip-10-14-21-116 systemd[1]: amazon-cloudwatch-agent.service: Scheduled restart job, restart counter is at 10.
Jan 14 20:29:03 ip-10-14-21-116 systemd[1]: Stopped Amazon CloudWatch Agent.
Jan 14 20:29:03 ip-10-14-21-116 systemd[1]: Started Amazon CloudWatch Agent.

The client fails with:

Jan 14 20:28:58 ip-10-14-21-116 tfc_agent_1[22937]: 2025-01-14T20:28:58.106Z [ERROR] core: Telemetry error: error="max retry time elapsed: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:4317: connect: connection refused\""

What version did you use?
Version: v1.300051.0

What config did you use?
Config:

{
  "agent": {
    "logfile": "/opt/aws/amazon-cloudwatch-agent/logs/amazon-cloudwatch-agent.log",
    "debug": false,
    "metrics_collection_interval": 60
  },
  "metrics": {
    "aggregation_dimensions": [
      [
        "InstanceId"
      ]
    ],
    "append_dimensions": {
      "AutoScalingGroupName": "$${aws:AutoScalingGroupName}",
      "ImageId": "$${aws:ImageId}",
      "InstanceId": "$${aws:InstanceId}",
      "InstanceType": "$${aws:InstanceType}"
    },
    "metrics_collected": {
      "cpu": {
        "measurement": [
          "cpu_usage_idle",
          "cpu_usage_iowait",
          "cpu_usage_user",
          "cpu_usage_system"
        ],
        "metrics_collection_interval": 60,
        "resources": [
          "*"
        ],
        "totalcpu": false
      },
      "disk": {
        "measurement": [
          "used_percent",
          "inodes_free"
        ],
        "metrics_collection_interval": 60,
        "resources": [
          "*"
        ]
      },
      "diskio": {
        "measurement": [
          "io_time"
        ],
        "metrics_collection_interval": 60,
        "resources": [
          "*"
        ]
      },
      "mem": {
        "measurement": [
          "mem_used_percent"
        ],
        "metrics_collection_interval": 60
      },
      "swap": {
        "measurement": [
          "swap_used_percent"
        ],
        "metrics_collection_interval": 60
      },
      "otlp": {
        "grpc_endpoint": "127.0.0.1:4317"
      }
    }
  }
}

Environment
OS: RHEL 8

@chadpatel
Copy link
Contributor

This is the line

if !distribution.IsSupportedValue(*metric.Value, distribution.MinValue, distribution.MaxValue) {

Presumably metric is null? Not sure yet

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants