-
Notifications
You must be signed in to change notification settings - Fork 147
To enable operators to quickly diagnose Loggregator related issues.
This FAQ will try and consolidate some helpful troubleshooting steps to acknowledge some common questions that Loggregator has received.
- TODO: How do I enable syslog forwarding for a job?
- TODO: How can I debug my Loggregator components?
- How can I check the health of my etcd cluster?
- How do I get etcd data when it is in TLS mode?
- How do I disable UAA for Traffic Controller?
- What do the Doppler properties mean?
- What do the Metron properties mean?
- What do the Traffic Controller properties mean?
- Why do I get this can't forward message: loggregator client pool is empty error?
Loggregator is a complex subcomponent of Cloud Foundry with many components on its own. We'll try to describe how we can better help you troubleshoot Loggregator in case you are having problems seeing your logs.
Rough thoughts/ideas for further expansion. Topics to expand:
- Manual Smoke Test
-
cf login
with [email protected] account curl mylogspinner.cfapps.io
- See logs come out from
cf logs mylogspinner
-
- Datadog
- visualize metrics
- Datadog Firehose Nozzle
- Datadog Config OSS
- Number of connections opened by component
lsof -c doppler
-
lsof -c trafficco
...
- Pprof
curl http://localhost:{Component Pprof Port}/debug/pprof/
go tool pprof http://localhost:{Component Pprof Port}/debug/pprof/heap
- Memory Dump, Goroutine dump, CPU profile.
- Component pprof Port is configured to 0, which will generate a random port. In order to determine, the pprof port, run
lsof -c doppler | grep LISTEN
.
- Goroutine dump
- SIGUSR1 signal to process
-
--debug
flag to the process- Not efficient because it requires process restart
- Calls to CC and UAA are timing out
- Check the access log in GoRouter to see if the request to CC and UAA are making it through. If you don't see it, it could be an IaaS issue. Provide AWS example. Soln: Switch from NAT gateway to NAT instance in AWS.
- etcd
- Check if Doppler's are advertising and Metron's are listening
- Check the health of the etcd cluster
Metron uses etcd for service discovery to find the Doppler cluster. If metron is unable to read from etcd OR if the Dopplers are not able to properly advertise themselves via etcd, then metron will panic.
# Run the following curl for each etcd node
curl -vvv http://<etcd server>:4001/v2/stats/leader
Make sure that there is only one leader for all the nodes. Unfortunately, we've come across a scenario where the etcdctl
tool will state that the cluster is healthy but it could be in a state where there are multiple leaders. This could be caused due to a network partition.
The fastest way to resolve this issue is by restarting each etcd node one at a time so that the cluster can achieve quorum.
Once the etcd cluster is restarted and restored, the dopplers and metrons will need to be restarted as well to ensure they are properly communicating with the etcd cluster.
If your CF environment has etcd deployed in TLS mode, you will no longer be able to simply curl
the data out.
Here are a few steps in order to get the data out to help troubleshoot.
bosh ssh etcd_z1/0
cd /var/vcap/packages/etcd/
- In order to get the list of available keys,
./etcdctl \
--cert-file /var/vcap/jobs/etcd/config/certs/client.crt \
--key-file /var/vcap/jobs/etcd/config/certs/client.key \
--ca-file /var/vcap/jobs/etcd/config/certs/server-ca.crt \
-C https://etcd-z1-0.cf-etcd.service.cf.internal:4001 \
ls doppler/meta --recursive
You should see output similar to the output below
/doppler/meta/z1
/doppler/meta/z1/doppler_z1
/doppler/meta/z1/doppler_z1/e27e8ab6-e29c-446d-a0dd-c692c7d16dd1
/doppler/meta/z1/doppler_z1/63af35d8-d233-422f-a389-e893f4d5b7ee
/doppler/meta/z1/doppler_z1/3a45b944-24dc-4563-bbae-fc53d5bacc43
/doppler/meta/z1/doppler_z1/51737ccd-5e14-4439-8dd1-c0e3ce2aca56
- Get the value of a key,
./etcdctl \
--cert-file /var/vcap/jobs/etcd/config/certs/client.crt \
--key-file /var/vcap/jobs/etcd/config/certs/client.key \
--ca-file /var/vcap/jobs/etcd/config/certs/server-ca.crt \
-C https://etcd-z1-0.cf-etcd.service.cf.internal:4001 \
get /doppler/meta/z1/doppler_z1/e27e8ab6-e29c-446d-a0dd-c692c7d16dd1
Note: The value https://etcd-z1-0.cf-etcd.service.cf.internal:4001
can be found within the EtcdUrls property in the config files. For example, /var/vcap/jobs/doppler/config/doppler.json
Traffic Controller has a property in its spec called traffic_controller.disable_access_control
.
By default this is false
. This is not a config property but rather a flag passed in to the traffic controller. See here.
Setting this property will make the logAccessAuthorizer
and the adminAuthorizer
always allow access to the app logs and firehose.
This feature was originally created so that Loggregator could be used in Lattice.
This error message shows up in the Metron logs if it doesn't have any registered Dopplers in its client pool.
It could be that Metron or Doppler cannot communicate with its Key-Value store ETCD.
- Look for the error message
Failed to connect to etcd
in the logs. - Verify you can access ETCD.
- Verify ETCD urls in the Metron config
/var/vcap/jobs/metron_agent/config/metron_agent.json
. - Try pinging ETCD to see if Doppler has advertised itself correctly.
# Old Doppler Endpoint
curl http://<your_etcd_ip>:<port/4001>/v2/keys/healthstatus/doppler?recursive=true
# New Doppler Endpoint
curl http://<your_etcd_ip>:<port/4001>/v2/keys/doppler/meta?recursive=true
The older endpoint will contain just the Doppler IP. The newer endpoint will contain json that may look like this.
{ "version": 1, "endpoints":["udp://<doppler_ip>:<port>", "tls://<doppler_ip:<port>"]}
If you see values being populated in either of the endpoints then it means your Doppler and Metron can both see ETCD and read/write to it.
-
Look at the ETCD key that Doppler is advertising. It should have the following structure.
# Old /healthstatus/doppler/<zone>/<job_name>/<index> # New /doppler/meta/<zone>/<job_name>/<index>
Compare each of these properties to the config within Metron - they should match.
We have come across scenarios where Doppler was on a different zone and was advertising
zone1
whereas Metron was configured with property"Zone": "zone2",
.This makes Metron look for a different key and thus unable to find the Doppler IP and protocol.
We came across a situation where ETCD got into a weird state and its process needed to be restarted. The tracker story is here and should be resolved.
Basically killall etcd