app health monitoring #1006

daveoy · 2025-01-09T22:28:42Z

has anyone thought of adding internal app metrics to show if problem daemons are having any issues?

following on from #1003 , i have added a few internal log events from various places inside the kmsg watcher so that i can track how often watchloops are starting / watchers are being revived

simple things like adding

	k.logCh <- &logtypes.Log{
		Message:   "[npd-internal] Entering watch loop",
		Timestamp: time.Now(),
	}

when we start the watch loop, or

        k.logCh <- &logtypes.Log{
		Message:   "[npd-internal] Reviving kmsg parser",
		Timestamp: time.Now(),
	}

whenever we revive the kmsg parser from inside the watcher. paired with config like:

{
  "plugin": "kmsg",
  "pluginConfig": {
    "revive": "true"
  },
  "logPath": "/dev/kmsg",
  "lookback": "5m",
  "bufferSize": 1000,
  "source": "kernel-monitor",
  "conditions": [
   ...
   ...
   ...
  ],
  "rules": [
    {
      "type": "temporary",
      "reason": "WatchLoopStarted",
      "pattern": "\\[npd-internal\\] Entering watch loop.*"
    },
    {
      "type": "temporary",
      "reason": "ParserRevived",
      "pattern": "\\[npd-internal\\] Reviving.*parser.*"
    },
   ...
   ...
   ...
  ]
}

we get prometheus metrics when the exporter is enabled (default) that look like:

# HELP problem_counter Number of times a specific type of problem have occurred.
# TYPE problem_counter counter
   ...
   ...
   ...
problem_counter{reason="ParserRevived"} 1
   ...
   ...
   ...
problem_counter{reason="WatchLoopStarted"} 2
   ...
   ...
   ...

The text was updated successfully, but these errors were encountered:

daveoy · 2025-01-09T22:32:59Z

example PR attached

daveoy · 2025-01-10T20:21:18Z

#1009 is another example of how app health monitoring can be made better by bubbling up logs from the underlying parser to determine the cause of a channel closure or potential partial channel reads as outlined in the downstream package:

we just add klog logging funcs to an internal logger that satisfies this https://pkg.go.dev/github.com/euank/[email protected]+incompatible/kmsgparser#Logger

so we can take advantage of log statements downstream: https://github.com/euank/go-kmsg-parser/blob/5ba4d492e455a77d25dcf0d2c4acc9f2afebef4e/kmsgparser/kmsgparser.go#L130-L143

daveoy · 2025-01-10T22:34:28Z

example log that appears in the application with the inclusion of #1009 :

logger.go:18] error reading /dev/kmsg: read /dev/kmsg: broken pipe

daveoy mentioned this issue Jan 9, 2025

feat(kmsgLogWatcher): send event on loop start #1007

Open

daveoy mentioned this issue Jan 10, 2025

feat(kmsgLogWatcher): add logger to parser #1009

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

app health monitoring #1006

app health monitoring #1006

daveoy commented Jan 9, 2025

daveoy commented Jan 9, 2025

daveoy commented Jan 10, 2025

daveoy commented Jan 10, 2025

app health monitoring #1006

app health monitoring #1006

Comments

daveoy commented Jan 9, 2025

daveoy commented Jan 9, 2025

daveoy commented Jan 10, 2025

daveoy commented Jan 10, 2025