Skip to content
Johannes Tuchscherer edited this page Aug 15, 2018 · 49 revisions

Frequently Asked Questions

Purpose

To enable operators to quickly diagnose Loggregator related issues.

This FAQ will try and consolidate some helpful troubleshooting steps to acknowledge some common questions that Loggregator has received.

Questions

Q: How can I debug my Loggregator components?

Loggregator is a complex subcomponent of Cloud Foundry with many components on its own. We'll try to describe how we can better help you troubleshoot Loggregator in case you are having problems seeing your logs.

Rough thoughts/ideas for further expansion. Topics to expand:

  • Manual Smoke Test
    • cf login with [email protected] account
    • curl mylogspinner.cfapps.io
    • See logs come out from cf logs mylogspinner
  • Datadog
    • visualize metrics
    • Datadog Firehose Nozzle
    • Datadog Config OSS
  • Number of connections opened by component
    • lsof -c doppler
    • lsof -c trafficco ...
  • Pprof
    • curl http://localhost:{Component Pprof Port}/debug/pprof/
    • go tool pprof http://localhost:{Component Pprof Port}/debug/pprof/heap
    • Memory Dump, Goroutine dump, CPU profile.
    • Component pprof Port is configured to 0, which will generate a random port. In order to determine, the pprof port, run lsof -c doppler | grep LISTEN.
  • Goroutine dump
    • SIGUSR1 signal to process
  • --debug flag to the process
    • Not efficient because it requires process restart
  • Calls to CC and UAA are timing out
    • Check the access log in GoRouter to see if the request to CC and UAA are making it through. If you don't see it, it could be an IaaS issue. Provide AWS example. Soln: Switch from NAT gateway to NAT instance in AWS.
  • etcd
    • Check if Doppler's are advertising and Metron's are listening
    • Check the health of the etcd cluster

Back to Top

Q: How can I check the health of my etcd cluster?

Metron uses etcd for service discovery to find the Doppler cluster. If metron is unable to read from etcd OR if the Dopplers are not able to properly advertise themselves via etcd, then metron will panic.

# Run the following curl for each etcd node
curl -vvv http://<etcd server>:4001/v2/stats/leader

Make sure that there is only one leader for all the nodes. Unfortunately, we've come across a scenario where the etcdctl tool will state that the cluster is healthy but it could be in a state where there are multiple leaders. This could be caused due to a network partition.

The fastest way to resolve this issue is by restarting each etcd node one at a time so that the cluster can achieve quorum.

Once the etcd cluster is restarted and restored, the dopplers and metrons will need to be restarted as well to ensure they are properly communicating with the etcd cluster.

Back to Top

Q: How do I get etcd data when it is in TLS mode?

If your CF environment has etcd deployed in TLS mode, you will no longer be able to simply curl the data out. Here are a few steps in order to get the data out to help troubleshoot.

  1. bosh ssh etcd_z1/0
  2. cd /var/vcap/packages/etcd/
  3. In order to get the list of available keys,
./etcdctl \
--cert-file /var/vcap/jobs/etcd/config/certs/client.crt \
--key-file /var/vcap/jobs/etcd/config/certs/client.key \
--ca-file /var/vcap/jobs/etcd/config/certs/server-ca.crt \
-C https://etcd-z1-0.cf-etcd.service.cf.internal:4001 \
ls doppler/meta --recursive

You should see output similar to the output below

/doppler/meta/z1
/doppler/meta/z1/doppler_z1
/doppler/meta/z1/doppler_z1/e27e8ab6-e29c-446d-a0dd-c692c7d16dd1
/doppler/meta/z1/doppler_z1/63af35d8-d233-422f-a389-e893f4d5b7ee
/doppler/meta/z1/doppler_z1/3a45b944-24dc-4563-bbae-fc53d5bacc43
/doppler/meta/z1/doppler_z1/51737ccd-5e14-4439-8dd1-c0e3ce2aca56
  1. Get the value of a key,
./etcdctl \ 
--cert-file /var/vcap/jobs/etcd/config/certs/client.crt \ 
--key-file /var/vcap/jobs/etcd/config/certs/client.key \
--ca-file /var/vcap/jobs/etcd/config/certs/server-ca.crt \
-C https://etcd-z1-0.cf-etcd.service.cf.internal:4001 \
get /doppler/meta/z1/doppler_z1/e27e8ab6-e29c-446d-a0dd-c692c7d16dd1

Note: The value https://etcd-z1-0.cf-etcd.service.cf.internal:4001 can be found within the EtcdUrls property in the config files. For example, /var/vcap/jobs/doppler/config/doppler.json

Back to Top

Q: How do I disable UAA for the Traffic Controller?

Traffic Controller has a property in its spec called traffic_controller.disable_access_control.

By default this is false. This is not a config property but rather a flag passed in to the traffic controller. See here.

Setting this property will make the logAccessAuthorizer and the adminAuthorizer always allow access to the app logs and firehose.

This feature was originally created so that Loggregator could be used in Lattice.

Back to Top

Q: Why do I get this can't forward message: loggregator client pool is empty error?

This error message shows up in the Metron logs if it doesn't have any registered Dopplers in its client pool.

Issue 1 - Can't find ETCD

It could be that Metron or Doppler cannot communicate with its Key-Value store ETCD.

  1. Look for the error message Failed to connect to etcd in the logs.
  2. Verify you can access ETCD.
  • Verify ETCD urls in the Metron config /var/vcap/jobs/metron_agent/config/metron_agent.json.
  • Try pinging ETCD to see if Doppler has advertised itself correctly.
# Old Doppler Endpoint
curl http://<your_etcd_ip>:<port/4001>/v2/keys/healthstatus/doppler?recursive=true

# New Doppler Endpoint
curl http://<your_etcd_ip>:<port/4001>/v2/keys/doppler/meta?recursive=true

The older endpoint will contain just the Doppler IP. The newer endpoint will contain json that may look like this.

{ "version": 1, "endpoints":["udp://<doppler_ip>:<port>", "tls://<doppler_ip:<port>"]}

Issue 2 - Mismatch ETCD keys

If you see values being populated in either of the endpoints then it means your Doppler and Metron can both see ETCD and read/write to it.

  • Look at the ETCD key that Doppler is advertising. It should have the following structure.

    # Old
    /healthstatus/doppler/<zone>/<job_name>/<index>
    
    # New
    /doppler/meta/<zone>/<job_name>/<index>
    

    Compare each of these properties to the config within Metron - they should match.

    We have come across scenarios where Doppler was on a different zone and was advertising zone1 whereas Metron was configured with property "Zone": "zone2",.

    This makes Metron look for a different key and thus unable to find the Doppler IP and protocol.

Issue 3 - ETCD is in a weird state

We came across a situation where ETCD got into a weird state and its process needed to be restarted. The tracker story is here and should be resolved.

Basically killall etcd

Back to Top