Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker images tagged: 0.99.0 onward don't fully support the usage of a web proxy in an air-gapped environment #11601

Open
whysi opened this issue Nov 5, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@whysi
Copy link

whysi commented Nov 5, 2024

Docker images tagged: 0.99.0 onward don't support the usage of a web proxy in an air-gapped environment

We have an OTEL collector running in a Docker Swarm cluster ( using image: otel/opentelemetry-collector-contrib:0.112.0 ) within internal network which does not allow direct connectivity to public internet (air-gapped environment).
The internal DNS used by the docker host does not resolve external names, this is for security purposes because the public names resolution is demanded to the company web proxy, we must go through that device to reach the internet for HTTP/HTTPS connections.
In our OTEL collector configuration we have an exporter that points to a public SAAS provider (Coralogix - eu2.coralogix.com) with authentication via bearer token.
As explained in the documentation, we used the HTTP_PROXY, HTTPS_PROXY and NO_PROXY environment variables to instruct the collector to use the company web proxy but it isn't working as expected in the versions after 0.98.0 .
By starting the collector in debug mode (in the configurations) we noticed that the data arrives correctly to the collector, but the collector fails to export.
The collector is not correctly using the web proxy to resolve public names and even if we let the container resolve the exporter still fails.
After various troubleshooting attempts, we found that older versions support proxying ( latest version that still works is otel/opentelemetry-collector-contrib:0.98.0 ) and the connection to the SAAS endpoint.

Steps to reproduce

  1. Collector deployed in internal network which does not allow outgoing traffic to backends in public internet.
  2. Have an internal web proxy that can connect to the internet
  3. Use a docker image of OTEL collector from tag 0.99.0 onwards
  4. Use proxy configuration with HTTP_PROXY, HTTPS_PROXY (and optionally NO_PROXY) environment variables with the web proxy of point 2) at container starts.
  5. Export the data to external network (external service)

What did you see instead?

We have tested different situations, these are the different outcomes:
Version newer or equal than 0.99.0, configure web proxy (via ENV variables) and the container is NOT able to resolve the exporter endpoint (DNS does not resolve public names): "name resolver error: produced zero addresses"

pickfirst/pickfirst.go:122	[pick-first-lb] [pick-first-lb 0xc003a92ab0] Received error from the name resolver: produced zero addresses	{"grpc_log": true}
[email protected]/clientconn.go:544	[core] [Channel #2]Channel Connectivity change to TRANSIENT_FAILURE	{"grpc_log": true}
grpcsync/callback_serializer.go:94	[core] error from balancer.UpdateClientConnState: bad resolver state	{"grpc_log": true}
internal/retry_sender.go:126	Exporting failed. Will retry the request after interval.	{"kind": "exporter", "data_type": "traces", "name": "*****", "error": "rpc error: code = Unavailable desc = name resolver error: produced zero addresses", "interval": "5.121492409s"}

Version newer or equal than 0.99.0, configure web proxy (via ENV variables) and the container is ABLE to resolve the exporter endpoint: "authentication handshake failed: EOF"

[email protected]/resolver_wrapper.go:200	[core] [Channel #1]Resolver state updated: {
  "Addresses": [
    {
      "Addr": "16.170.111.131:443",
      "ServerName": "",
      "Attributes": null,
      "BalancerAttributes": null,
      "Metadata": null
    }
  ],
  "Endpoints": [
    {
      "Addresses": [
        {
          "Addr": "16.170.111.131:443",
          "ServerName": "",
          "Attributes": null,
          "BalancerAttributes": null,
          "Metadata": null
        }
      ],
      "Attributes": null
    }
  ],
  "ServiceConfig": null,
  "Attributes": null
} ()	{"grpc_log": true}

...
...
...

[email protected]/clientconn.go:544	[core] [Channel #1]Channel Connectivity change to CONNECTING	{"grpc_log": true}
[email protected]/clientconn.go:1199	[core] [Channel #1 SubChannel #2]Subchannel Connectivity change to CONNECTING	{"grpc_log": true}
[email protected]/clientconn.go:1317	[core] [Channel #1 SubChannel #2]Subchannel picks a new address "16.170.111.131:443" to connect	{"grpc_log": true}
pickfirst/pickfirst.go:176	[pick-first-lb] [pick-first-lb 0xc0030a0990] Received SubConn state update: 0xc0030a0a20, {ConnectivityState:CONNECTING ConnectionError:<nil> connectedAddress:{Addr: ServerName: Attributes:<nil> BalancerAttributes:<nil> Metadata:<nil>}}	{"grpc_log": true}
[email protected]/clientconn.go:1319	[core] Creating new client transport to "{Addr: \"16.170.111.131:443\", ServerName: \"ingress.eu2.coralogix.com:443\", }": connection error: desc = "transport: authentication handshake failed: EOF"	{"grpc_log": true}
[email protected]/clientconn.go:1379	[core] [Channel #1 SubChannel #2]grpc: addrConn.createTransport failed to connect to {Addr: "16.170.111.131:443", ServerName: "ingress.eu2.coralogix.com:443", }. Err: connection error: desc = "transport: authentication handshake failed: EOF"	{"grpc_log": true}
[email protected]/clientconn.go:1201	[core] [Channel #1 SubChannel #2]Subchannel Connectivity change to TRANSIENT_FAILURE, last error: connection error: desc = "transport: authentication handshake failed: EOF"	{"grpc_log": true}
pickfirst/pickfirst.go:176	[pick-first-lb] [pick-first-lb 0xc0030a0990] Received SubConn state update: 0xc0030a0a20, {ConnectivityState:TRANSIENT_FAILURE ConnectionError:connection error: desc = "transport: authentication handshake failed: EOF" connectedAddress:{Addr: ServerName: Attributes:<nil> BalancerAttributes:<nil> Metadata:<nil>}}	{"grpc_log": true}
[email protected]/clientconn.go:544	[core] [Channel #1]Channel Connectivity change to TRANSIENT_FAILURE	{"grpc_log": true}
internal/retry_sender.go:126	Exporting failed. Will retry the request after interval.	{"kind": "exporter", "data_type": "metrics", "name": "coralogix", "error": "rpc error: code = Unavailable desc = connection error: desc = \"transport: authentication handshake failed: EOF\"", "interval": "6.314044291s"}
[email protected]/clientconn.go:1201	[core] [Channel #1 SubChannel #2]Subchannel Connectivity change to IDLE, last error: connection error: desc = "transport: authentication handshake failed: EOF"	{"grpc_log": true}
pickfirst/pickfirst.go:176	[pick-first-lb] [pick-first-lb 0xc0030a0990] Received SubConn state update: 0xc0030a0a20, {ConnectivityState:IDLE ConnectionError:connection error: desc = "transport: authentication handshake failed: EOF" connectedAddress:{Addr: ServerName: Attributes:<nil> BalancerAttributes:<nil> Metadata:<nil>}}	{"grpc_log": true}

Watching container logs I found 2 different type of errors.

Version older or equal than 0.98.0, configure web proxy (via ENV variables) and the container is NOT able to resolve the exporter endpoint (DNS does not resolve public names): EVERYTHING WORKS AS EXPECTED - "Channel Connectivity change to READY"

2024-11-05T13:27:03.310Z	info	zapgrpc/zapgrpc.go:176	[core] [Channel #1]Resolver state updated: {
  "Addresses": [
    {
      "Addr": "ingress.eu2.coralogix.com:443",
      "ServerName": "",
      "Attributes": null,
      "BalancerAttributes": null,
      "Metadata": null
    }
  ],
  "Endpoints": [
    {
      "Addresses": [
        {
          "Addr": "ingress.eu2.coralogix.com:443",
          "ServerName": "",
          "Attributes": null,
          "BalancerAttributes": null,
          "Metadata": null
        }
      ],
      "Attributes": null
    }
  ],
  "ServiceConfig": null,
  "Attributes": null
} (resolver returned new addresses)	{"grpc_log": true}

info	zapgrpc/zapgrpc.go:176	[core] [Channel #1 SubChannel #2]Subchannel created	{"grpc_log": true}
info	zapgrpc/zapgrpc.go:176	[core] [Channel #1]Channel Connectivity change to CONNECTING	{"grpc_log": true}
info	zapgrpc/zapgrpc.go:176	[core] [Channel #1]Channel exiting idle mode	{"grpc_log": true}
info	zapgrpc/zapgrpc.go:176	[core] [Channel #1 SubChannel #2]Subchannel Connectivity change to CONNECTING	{"grpc_log": true}
info	zapgrpc/zapgrpc.go:176	[core] [Channel #1 SubChannel #2]Subchannel picks a new address "ingress.eu2.coralogix.com:443" to connect	{"grpc_log": true}
info	zapgrpc/zapgrpc.go:176	[core] [pick-first-lb 0xc002a1fc50] Received SubConn state update: 0xc002a1fce0, {ConnectivityState:CONNECTING ConnectionError:<nil>}	{"grpc_log": true}
info	[email protected]/metrics_receiver.go:272	Starting discovery manager	{"kind": "receiver", "name": "prometheus", "data_type": "metrics"}
info	[email protected]/metrics_receiver.go:250	Scrape job added	{"kind": "receiver", "name": "prometheus", "data_type": "metrics", "jobName": "docker_hosts_metrics"}
debug	discovery/manager.go:286	Starting provider	{"kind": "receiver", "name": "prometheus", "data_type": "metrics", "provider": "static/0", "subs": "map[docker_hosts_metrics:{}]"}
info	[email protected]/service.go:169	Everything is ready. Begin running and processing data.
debug	discovery/manager.go:320	Discoverer channel closed	{"kind": "receiver", "name": "prometheus", "data_type": "metrics", "provider": "static/0"}
info	[email protected]/metrics_receiver.go:326	Starting scrape manager	{"kind": "receiver", "name": "prometheus", "data_type": "metrics"}
warn	localhostgate/featuregate.go:63	The default endpoints for all servers in components will change to use localhost instead of 0.0.0.0 in a future version. Use the feature gate to preview the new default.	{"feature gate ID": "component.UseLocalHostAsDefaultHost"}
info	zapgrpc/zapgrpc.go:176	[core] [Channel #1 SubChannel #2]Subchannel Connectivity change to READY	{"grpc_log": true}
info	zapgrpc/zapgrpc.go:176	[core] [pick-first-lb 0xc002a1fc50] Received SubConn state update: 0xc002a1fce0, {ConnectivityState:READY ConnectionError:<nil>}	{"grpc_log": true}
info	zapgrpc/zapgrpc.go:176	[core] [Channel #1]Channel Connectivity change to READY	{"grpc_log": true}

What version did you use?

TAG: 0.99.0 onwards (included) generate errors.
TAG: 0.98.0 and prior works as expected.

What config did you use?

Docker compose yaml content:

version: "3.9"
services:
  otel:
    image: otel/opentelemetry-collector-contrib:0.99.0
    command: ["--config=/etc/otelcol-contrib/config.yaml"]
    configs:
      - source: config.yaml
        target: /etc/otelcol-contrib/config.yaml
        mode: 0744         
    networks:
      - private
    environment:      
      HTTP_PROXY: "MY_HTTP_PROXY:PORT"
      HTTPS_PROXY: "MY_HTTPS_PROXY:PORT"      
      NO_PROXY: "my.domain,localhost,127.0.0.1"            
    ports:
      - 4317:4317 
      - 4318:4318 
    deploy:
      mode: replicated
      replicas: 1    
configs:
  config.yaml:
    external: true
networks:
  private:
    driver: overlay
    attachable: true

config.yaml content:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
  otlp/http:
    protocols:
      http:
        endpoint: 0.0.0.0:4318
  prometheus:
    config:
      scrape_configs:              
        - job_name: cadvisor_metrics
          scrape_interval: 1m
          metrics_path: /metrics
          static_configs:
            - targets:
              - 'mydockerhost.my.domain:8080'
exporters:
  coralogix:
    domain: "eu2.coralogix.com"
    private_key: "My_PRIVATE_KEY"
    application_name: "MY_APP_NAME"
    subsystem_name: "MY_SUBSYSTEM_NAME"
    timeout: 30s
service:
  pipelines:
    metrics:
      receivers: [ prometheus ]
      exporters: [ coralogix ]

Additional context

As you can see something changed from version 0.99.0. First of all when using an HTTP/HTTPS proxy any http client should demand the name resolution to the web proxy instead of trying to ask to the DNS server.
Older versions are doing this right, newer versions are not.

Even when using a newer version and making the container able to resolve the public name of the Coralogix endpoint we can see that is uses a public IP address in the "Addr": section of the logs, instead of the dns name used in the older versione of the collector.
I think that we get the authentication handshake failed: EOF error because the http client checks the TSL certificate presented by the public server and it does not match the IP address that it's trying to use in the connection, and this won't happen if using the real endpoint name that's for sure in the subject alternative name of the certificate.

Searching in the issues i found this: #10814 (comment)
Not the same issue, even though it's related to the TLS connection of the exporter, but they solved by using an older version.

@whysi whysi added the bug Something isn't working label Nov 5, 2024
@FabioSirugo
Copy link

Is there any news for this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants