Skip to content

runtime-metrics implements one-dimensional numeric metrics, a server to publish these, and a client to scrape them.

License

Notifications You must be signed in to change notification settings

KarelKubat/runtime-metrics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

78 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

runtimemetrics

-- import "github.com/KarelKubat/runtime-metrics"

Package runtimemetrics implements the collection of in-program metrics, a server to publish these, and a client to collect them.

What Is It

The design principle of runtime-metrics is that it's light-weight enough to be embedded into a server, and that it won't hinder performance. It can collect and report simple, one-dimensional metrics: counters, sums and averages - either over the lifetime of the metric, or over a given period (e.g., counts per minute).

It cannot handle multi-dimensional data, such as lists of numbers or strings. It doesn't retain series; it's designed so that the impact of a metric is known a-priori. If you need to retain lists of counters, sums or averages (e.g., to analyze trends), then scrape them with an external client and retain there.

Basic In-Program Usage

To collect metrics inside a program and to act on changes; i.e., without publishing the metrics using a server and without scraping them using a client, the following can be used. As an example, we wish to track the ratio of failures, and log something when this ratio exceeds 1% over a period of 30 seconds.

import "github.com/KarelKubat/runtime-metrics/base"
...

// Create the metrics. We have:
// - NewCount() for incremental counting
// - NewSum() for totalling float64 values
// - NewAverage() for averaging float64 values
// and metrics per given a period:
// - NewCountPerDuration(d time.Duration)
// - NewSumPerDuration(d time.Duration)
// - NewAveragePerDuration(d time.Duration)

errorRatio = base.NewAveragePerDuration(30 * time.Second)

// Check failures vs. totals and do something when there is >= 1% failures.
// Poll every 30 seconds, a shorter period won't help because the average
// cannot change any quicker.
go func() {
  // average is the recorded value, n is the number of cases,
  // until is the up-to timestamp of the calculation
  average, n, until := errorRatio.Report()
  if average >= 0.01 {
    log.Printf("WARNING %v percent of lookups is failing " +
      "over a period of 30 seconds until %v, %v cases ",
      average * 100.0, until, n)
    }
  }
  time.Sleep(time.Second * 30)
}()

// Loop and track totals and failures
for {
  err := lookupSomething()  // hypothetical function
  if err != nil {
    errorRatio.Mark(1.0)    // mark error (and increase #-cases)
  } else {
    errorRatio.Mark(0.0)    // mark success (only increase #-cases)
}

It should be noted here that there are different ways to solve this. One could also use two counters, one for the total loops and one for the failures, and divide them to get a ratio.

In this case, it's also good to limit the collection of metrics and their reporting to a given duration; otherwise, a long run of successes might mask suddenly occurring errors until it's too late.

The metric types all have a somewhat similar API: New*() instantiates a metric, Mark() registers an event, and Report() returns some result. In the case of an average, Mark() expects one float64 argument, and returns three values: average, number of cases, and a timestamp.

Publishing Metrics

In order to publish metrics, they are added to a registry, and a server is started:

import "github.com/KarelKubat/runtime-metrics/base"
import "github.com/KarelKubat/runtime-metrics/registry"
import "github.com/KarelKubat/runtime-metrics/reporter"
...

errorRatio := base.NewAveragePerDuration(30 * time.Second)
err := registry.AddAveragePerDuration("lookup-error-ratio-per-30-sec", errorRatio)
if err != nil { ... }                 // collision of name

go func() {
  err := reporter.StartReporter(":1234")
  if err != nil { ... }               // probably port 1234 is already in use
}()

Scraping Metrics

Published metrics can be scraped by a client:

import "github.com/KarelKubat/runtime-metrics/reporter"

c, err := reporter.NewClient(":1234")
if err != nil { ... }  // connection error

av, n, until, err := c.AveragePerDuration("lookup-error-ratio-per-30-sec")
if err != nil { ... }  // metric doesn't exist

if av > 0.01 {
  log.Printf("WARNING %v percent of lookups is failing", av * 100)
}

There is also discovery: a client can get a list of all the names of counts, sums, and averages, and query these. See demo/demosrc/client_allnames.go for an example.

Further Reading

Be sure to read the docs of the base/ package to understand what the metrics do. Particularly make sure you understand how *PerDuration metrics work, they always report results that "lag" a duration.

For information how to publish metrics, read the server section in package reporter/. For scraping, read the client section. A runnable example is provided in demo/ with sources in demo/demosrc/.

Packages tools/, namedset/ and api/ are for internal usage.

License

This software is distributed under GPLv3, see LICENSE.md. Basically that means that you can use it as you like, there's no cost and no guarantee, you are free no modify the code as you see fit, but if you do, you must make your changes available to others under the same conditions.

As a courtesy, I'd love to be informed of any changes that you make - I could incorporate them into a next release, with full credits if you like.

Usage

About

runtime-metrics implements one-dimensional numeric metrics, a server to publish these, and a client to scrape them.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages