Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Otel collector not scrapping metrics properly #37281

Closed
gaur-piyush opened this issue Jan 17, 2025 · 15 comments
Closed

Otel collector not scrapping metrics properly #37281

gaur-piyush opened this issue Jan 17, 2025 · 15 comments
Assignees
Labels
exporter/googlecloud question Further information is requested receiver/prometheus Prometheus receiver

Comments

@gaur-piyush
Copy link

gaur-piyush commented Jan 17, 2025

Component(s)

exporter/googlecloud

Describe the issue you're reporting

Hi Team,

We're using google cloud exporter with our Otel configuration. However, we have seen that Otel collector doesn't scrape all the metrics. Can you help us identify the problem? From the configuration endpoint, I couldn't find an issue.

Attached is the configuration currently in use.

Regards,
Piyush Gaur

mode: deployment
presets:
    logsCollection:
        enabled: false
        includeCollectorLogs: false
        storeCheckpoints: false
    hostMetrics:
        enabled: false
    kubernetesAttributes:
        enabled: false
    kubernetesEvents:
        enabled: false
    clusterMetrics:
        enabled: false
    kubeletMetrics:
        enabled: false
configMap:
    create: true
config:
    exporters:
        debug:
            verbosity: normal
        googlecloud:
            project: PORJECT_ID
            metric:
                prefix: custom.googleapis.com
            sending_queue:
                enabled: true
                queue_size: 20000
    extensions:
        health_check:
          endpoint: 0.0.0.0:13133
    processors:
        batch:
            send_batch_size: 8192
            send_batch_max_size: 10000
            timeout: 10s
        memory_limiter:
            check_interval: 5s
            limit_percentage: 80
            spike_limit_percentage: 30
        resourcedetection:
            detectors: [gcp]
            timeout: 10s
        filter/cv:
            metrics:
                include:
                    match_type: strict
                    metric_names:
                        - LIST_OF_METRICS
    receivers:
        prometheus:
            config:
                scrape_configs:
                    - job_name: test
                      metrics_path: /metrics
                      scrape_interval: 300s
                      kubernetes_sd_configs:
                      - role: endpoints
                        namespaces:
                          names:
                            - test
                      relabel_configs:
                        - source_labels: [__meta_kubernetes_endpoints_name]
                          action: keep
                          regex: cloud-volumes-infrastructure
                        - source_labels: [__meta_kubernetes_namespace]
                          action: replace
                          target_label: namespace
                        - source_labels: [__meta_kubernetes_service_name]
                          action: replace
                          target_label: service
                        - source_labels: [__meta_kubernetes_pod_node_name]
                          action: replace
                          target_label: node                  
    service:
        extensions:
            - health_check
        pipelines:
            metrics:
                receivers: [prometheus]
                processors: [filter/cv, resourcedetection]
                exporters: [googlecloud, debug]
command:
    name: otelcol-contrib
    extraArgs: []
serviceAccount:
    create: true
    annotations: {}
    name: "open-telemetry-sa"
clusterRole:
    create: true
    annotations: {}
    name: ""
    rules:
    - apiGroups:
      - "apps"
      - ""
      resources:
      - 'nodes'
      - 'nodes/proxy'
      - 'nodes/metrics'
      - 'services'
      - 'endpoints'
      - 'pods'
      - 'ingresses'
      - 'configmaps'
      verbs:
      - 'get'
      - 'list'
      - 'watch'
    - apiGroups:
      - extensions
      - networking.k8s.io
      resources:
      - ingresses/status
      - ingresses
      verbs:
      - get
      - list
      - watch
    - nonResourceURLs:
      - /metrics
      verbs:
       - 'get'
       - 'list'
       - 'watch'
    clusterRoleBinding:
        annotations: {}
        name: ""
podSecurityContext: {}
securityContext: {}
nodeSelector: {}
tolerations: []
affinity: {}
topologySpreadConstraints: []
priorityClassName: ""
extraEnvs: []
extraVolumes: []
extraVolumeMounts: []
ports:
    otlp:
        enabled: true
        containerPort: 4317
        servicePort: 4317
        hostPort: 4317
        protocol: TCP
    otlp-http:
        enabled: true
        containerPort: 4318
        servicePort: 4318
        hostPort: 4318
        protocol: TCP
    otlp-lb:
        enabled: true
        containerPort: 55681
        servicePort: 55681
        hostPort: 55681
        protocol: TCP
    metrics:
        enabled: true
        containerPort: 8888
        servicePort: 8888
        protocol: TCP
resources: {}
podAnnotations: {}
podLabels: {}
hostNetwork: false
dnsPolicy: ""
replicaCount: 3
revisionHistoryLimit: 10
annotations: {}
extraContainers: []
initContainers: []
lifecycleHooks: {}
service:
    type: ClusterIP
    annotations: {}
ingress:
    enabled: false
    additionalIngresses: []
podMonitor:
    enabled: false
    metricsEndpoints:
        - port: metrics
    extraLabels: {}
serviceMonitor:
    enabled: false
    metricsEndpoints:
        - port: metrics
    extraLabels: {}
podDisruptionBudget:
    enabled: false
autoscaling:
    enabled: false
    minReplicas: 1
    maxReplicas: 10
    behavior: {}
    targetCPUUtilizationPercentage: 80
rollout:
    rollingUpdate: {}
    strategy: RollingUpdate
prometheusRule:
    enabled: false
    groups: []
    defaultRules:
        enabled: false
    extraLabels: {}
statefulset:
    volumeClaimTemplates: []
    podManagementPolicy: "Parallel"
networkPolicy:
    enabled: false
    annotations: {}
    allowIngressFrom: []
    extraIngressRules: []
    egressRules: []

values.txt

@gaur-piyush gaur-piyush added the needs triage New item requiring triage label Jan 17, 2025
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@dashpole
Copy link
Contributor

When you say "Otel collector doesn't scrape all the metrics", what do you mean, specifically? Do you see errors in the logs? Are all metrics missing, or just some metrics (which ones?)

@dashpole dashpole self-assigned this Jan 17, 2025
@dashpole dashpole removed the needs triage New item requiring triage label Jan 17, 2025
@dashpole
Copy link
Contributor

Also note that we generally recommend using the googlemanagedprometheus exporter for metrics over the googlecloud exporter. It has lower pricing, and better support for promql.

@gaur-piyush
Copy link
Author

@dashpole It is not scraping some of the metrics. Even for the missing metrics, we can see that it scrapped for a while and then stopped.

@dashpole
Copy link
Contributor

Is the prometheus receiver failing to scrape the metric, or is the googlecloud exporter failing to export it? Do you see any errors in the logs?

@gaur-piyush
Copy link
Author

@dashpole It is scraping metrics but a few of the metrics are getting dropped. I checked and found that it might be due to relabel config in our scrape config specification. I will update on this. Please keep this open for a while.

@gaur-piyush
Copy link
Author

@dashpole We keep seeing this error in our Otel pod logs although there is no issue with the metrics now.

2025-01-23T11:52:10.317Z warn internal/transaction.go:111 Failed to scrape Prometheus endpoint {"kind": "receiver", "name": "prometheus", "data_type": "metrics", "scrape_timestamp": 1737633130315, "target_labels": "{__name__=\"up\", instance=\"100.73.178.42:15090\", job=\"\", namespace=\"\", node=\"\", service=\"\"}"}

I am unable to understand why we keep getting this.

@dashpole
Copy link
Contributor

That means the prometheus receiver is unable to scrape one of the targets. You will need to enable debug logging in the collector to see the detailed error message

@dashpole dashpole added receiver/prometheus Prometheus receiver question Further information is requested labels Jan 23, 2025
Copy link
Contributor

Pinging code owners for receiver/prometheus: @Aneurysm9 @dashpole. See Adding Labels via Comments if you do not have permissions to add labels yourself. For example, comment '/label priority:p2 -needs-triaged' to set the priority and remove the needs-triaged label.

@gaur-piyush
Copy link
Author

@dashpole I have enabled detailed logging but I am not setting anything else for the error mentioned above. Am I missing something here?

@dashpole
Copy link
Contributor

You did this, right?

service:
  telemetry:
    metrics:
      level: detailed

@gaur-piyush
Copy link
Author

You did this, right?

service:
telemetry:
metrics:
level: detailed

Apologies, I did it for Otel, not here. Let me make the changes and see what's going on.

@gaur-piyush
Copy link
Author

gaur-piyush commented Jan 23, 2025

I have enabled the detailed for metrics telemetry but not seeing any detailed information for the above error.

service:
      telemetry:
        metrics:
          level: detailed

@dashpole
Copy link
Contributor

Oh, my bad. I copy-pasted the wrong thing from https://opentelemetry.io/docs/collector/internal-telemetry/#configure-internal-logs. It should be:

service:
  telemetry:
    logs:
      level: debug

@gaur-piyush
Copy link
Author

@dashpole We're good now. Due to logging, we can identify the root cause of the error. Thanks a ton.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
exporter/googlecloud question Further information is requested receiver/prometheus Prometheus receiver
Projects
None yet
Development

No branches or pull requests

2 participants