Skip to content

Commit

Permalink
Adds prometheus export point to the agent pod. (#2983)
Browse files Browse the repository at this point in the history
* Add metrics system (prometheus) to the agent.

* Always have metrics, just doesnt do anything.

* Add metrics for sniffer.

* Add metrics for stealer.

* How to install prometheus // Add more metrics to the reply

* Do not register the gauge multiple times.

* tcpoutgoing metrics

* udpoutgoing metrics

* docs

* move metrics to modules

* bump protocol

* fix axum-server version

* info -> trace

* just have metrics atomics everywhere

* drop actors

* gate metrics behind config

* use socketaddr

* spawn inside spawn

* error impl send

* move where we inc subs for sniffer

* no clone

* dec things closer to remove

* better help

Co-authored-by: Michał Smolarek <[email protected]>

* fix help

Co-authored-by: Michał Smolarek <[email protected]>

* no async file manager

* split filtered unfiltered steal inc

* remove unfiltered inc from wrong place

* cancellation token

* unit test for metrics

* remove unused

* rustfmt

* schema

* schema 2

* md

* appease clippy

* crypto provider for test

* encode_to_string // cancellation on error

* better error

Co-authored-by: Michał Smolarek <[email protected]>

* no more metricserror

* outdated docs

* connection metrics to run tasks

* unused

* remove todo

* line

Co-authored-by: Michał Smolarek <[email protected]>

* inc not dec

Co-authored-by: Michał Smolarek <[email protected]>

* inc stream fd

Co-authored-by: Michał Smolarek <[email protected]>

* only dec once

Co-authored-by: Michał Smolarek <[email protected]>

* drop filemanager updates metrics

* sniffer drop and update_packet_filter metrics

* remove dec from extra sub case

* move sub to PortSub add

* drop for PortSubs, zero subs counter

* new and drop for unfiltered task

* remove inc from run

* drop for filtered

* docs

* remove metrics from some places

* cancel

* near insert and remove

* connection sub inc

* connected clients

* dns request

* http in progress

* protocol

* http request in progress

* dns

* 2 new decs in tcp outgoing

* dec fd twice

* filtered port

* some docs

* improve error with display impl

* Make UdpOutgoing look more like TcpOutgoing

* AgentError everywhere

* allow type complexity

* Remove a few extra AgentResult

* bump protocol

* fix link doc

* fix wrong doc

* Put global gauges in state so tests dont explode.

* Fix repeated value.

* changelog

* Dont use default registry.

* no ignoring alreadyreg

* lil docs

* Remove example comments from readme.

---------

Co-authored-by: Michał Smolarek <[email protected]>
  • Loading branch information
meowjesty and Razz4780 authored Jan 22, 2025
1 parent 15f9e34 commit 57bdab5
Show file tree
Hide file tree
Showing 49 changed files with 1,856 additions and 668 deletions.
584 changes: 325 additions & 259 deletions Cargo.lock

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions changelog.d/2975.added.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Add prometheus metrics to the mirrord-agent.
12 changes: 10 additions & 2 deletions mirrord-schema.json
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
{
"$schema": "http://json-schema.org/draft-07/schema#",
"title": "LayerFileConfig",
"description": "mirrord allows for a high degree of customization when it comes to which features you want to enable, and how they should function.\n\nAll of the configuration fields have a default value, so a minimal configuration would be no configuration at all.\n\nThe configuration supports templating using the [Tera](https://keats.github.io/tera/docs/) template engine. Currently we don't provide additional values to the context, if you have anything you want us to provide please let us know.\n\nTo use a configuration file in the CLI, use the `-f <CONFIG_PATH>` flag. Or if using VSCode Extension or JetBrains plugin, simply create a `.mirrord/mirrord.json` file or use the UI.\n\nTo help you get started, here are examples of a basic configuration file, and a complete configuration file containing all fields.\n\n### Basic `config.json` {#root-basic}\n\n```json { \"target\": \"pod/bear-pod\", \"feature\": { \"env\": true, \"fs\": \"read\", \"network\": true } } ```\n\n### Basic `config.json` with templating {#root-basic-templating}\n\n```json { \"target\": \"{{ get_env(name=\"TARGET\", default=\"pod/fallback\") }}\", \"feature\": { \"env\": true, \"fs\": \"read\", \"network\": true } } ```\n\n### Complete `config.json` {#root-complete}\n\nDon't use this example as a starting point, it's just here to show you all the available options. ```json { \"accept_invalid_certificates\": false, \"skip_processes\": \"ide-debugger\", \"target\": { \"path\": \"pod/bear-pod\", \"namespace\": \"default\" }, \"connect_tcp\": null, \"agent\": { \"log_level\": \"info\", \"json_log\": false, \"labels\": { \"user\": \"meow\" }, \"annotations\": { \"cats.io/inject\": \"enabled\" }, \"namespace\": \"default\", \"image\": \"ghcr.io/metalbear-co/mirrord:latest\", \"image_pull_policy\": \"IfNotPresent\", \"image_pull_secrets\": [ { \"secret-key\": \"secret\" } ], \"ttl\": 30, \"ephemeral\": false, \"communication_timeout\": 30, \"startup_timeout\": 360, \"network_interface\": \"eth0\", \"flush_connections\": true }, \"feature\": { \"env\": { \"include\": \"DATABASE_USER;PUBLIC_ENV\", \"exclude\": \"DATABASE_PASSWORD;SECRET_ENV\", \"override\": { \"DATABASE_CONNECTION\": \"db://localhost:7777/my-db\", \"LOCAL_BEAR\": \"panda\" }, \"mapping\": { \".+_TIMEOUT\": \"1000\" } }, \"fs\": { \"mode\": \"write\", \"read_write\": \".+\\\\.json\" , \"read_only\": [ \".+\\\\.yaml\", \".+important-file\\\\.txt\" ], \"local\": [ \".+\\\\.js\", \".+\\\\.mjs\" ] }, \"network\": { \"incoming\": { \"mode\": \"steal\", \"http_filter\": { \"header_filter\": \"host: api\\\\..+\" }, \"port_mapping\": [[ 7777, 8888 ]], \"ignore_localhost\": false, \"ignore_ports\": [9999, 10000] }, \"outgoing\": { \"tcp\": true, \"udp\": true, \"filter\": { \"local\": [\"tcp://1.1.1.0/24:1337\", \"1.1.5.0/24\", \"google.com\", \":53\"] }, \"ignore_localhost\": false, \"unix_streams\": \"bear.+\" }, \"dns\": { \"enabled\": true, \"filter\": { \"local\": [\"1.1.1.0/24:1337\", \"1.1.5.0/24\", \"google.com\"] } } }, \"copy_target\": { \"scale_down\": false } }, \"operator\": true, \"kubeconfig\": \"~/.kube/config\", \"sip_binaries\": \"bash\", \"telemetry\": true, \"kube_context\": \"my-cluster\" } ```\n\n# Options {#root-options}",
"description": "mirrord allows for a high degree of customization when it comes to which features you want to enable, and how they should function.\n\nAll of the configuration fields have a default value, so a minimal configuration would be no configuration at all.\n\nThe configuration supports templating using the [Tera](https://keats.github.io/tera/docs/) template engine. Currently we don't provide additional values to the context, if you have anything you want us to provide please let us know.\n\nTo use a configuration file in the CLI, use the `-f <CONFIG_PATH>` flag. Or if using VSCode Extension or JetBrains plugin, simply create a `.mirrord/mirrord.json` file or use the UI.\n\nTo help you get started, here are examples of a basic configuration file, and a complete configuration file containing all fields.\n\n### Basic `config.json` {#root-basic}\n\n```json { \"target\": \"pod/bear-pod\", \"feature\": { \"env\": true, \"fs\": \"read\", \"network\": true } } ```\n\n### Basic `config.json` with templating {#root-basic-templating}\n\n```json { \"target\": \"{{ get_env(name=\"TARGET\", default=\"pod/fallback\") }}\", \"feature\": { \"env\": true, \"fs\": \"read\", \"network\": true } } ```\n\n### Complete `config.json` {#root-complete}\n\nDon't use this example as a starting point, it's just here to show you all the available options. ```json { \"accept_invalid_certificates\": false, \"skip_processes\": \"ide-debugger\", \"target\": { \"path\": \"pod/bear-pod\", \"namespace\": \"default\" }, \"connect_tcp\": null, \"agent\": { \"log_level\": \"info\", \"json_log\": false, \"labels\": { \"user\": \"meow\" }, \"annotations\": { \"cats.io/inject\": \"enabled\" }, \"namespace\": \"default\", \"image\": \"ghcr.io/metalbear-co/mirrord:latest\", \"image_pull_policy\": \"IfNotPresent\", \"image_pull_secrets\": [ { \"secret-key\": \"secret\" } ], \"ttl\": 30, \"ephemeral\": false, \"communication_timeout\": 30, \"startup_timeout\": 360, \"network_interface\": \"eth0\", \"flush_connections\": true, \"metrics\": \"0.0.0.0:9000\", }, \"feature\": { \"env\": { \"include\": \"DATABASE_USER;PUBLIC_ENV\", \"exclude\": \"DATABASE_PASSWORD;SECRET_ENV\", \"override\": { \"DATABASE_CONNECTION\": \"db://localhost:7777/my-db\", \"LOCAL_BEAR\": \"panda\" }, \"mapping\": { \".+_TIMEOUT\": \"1000\" } }, \"fs\": { \"mode\": \"write\", \"read_write\": \".+\\\\.json\" , \"read_only\": [ \".+\\\\.yaml\", \".+important-file\\\\.txt\" ], \"local\": [ \".+\\\\.js\", \".+\\\\.mjs\" ] }, \"network\": { \"incoming\": { \"mode\": \"steal\", \"http_filter\": { \"header_filter\": \"host: api\\\\..+\" }, \"port_mapping\": [[ 7777, 8888 ]], \"ignore_localhost\": false, \"ignore_ports\": [9999, 10000] }, \"outgoing\": { \"tcp\": true, \"udp\": true, \"filter\": { \"local\": [\"tcp://1.1.1.0/24:1337\", \"1.1.5.0/24\", \"google.com\", \":53\"] }, \"ignore_localhost\": false, \"unix_streams\": \"bear.+\" }, \"dns\": { \"enabled\": true, \"filter\": { \"local\": [\"1.1.1.0/24:1337\", \"1.1.5.0/24\", \"google.com\"] } } }, \"copy_target\": { \"scale_down\": false } }, \"operator\": true, \"kubeconfig\": \"~/.kube/config\", \"sip_binaries\": \"bash\", \"telemetry\": true, \"kube_context\": \"my-cluster\" } ```\n\n# Options {#root-options}",
"type": "object",
"properties": {
"accept_invalid_certificates": {
Expand Down Expand Up @@ -255,7 +255,7 @@
"properties": {
"annotations": {
"title": "agent.annotations {#agent-annotations}",
"description": "Allows setting up custom annotations for the agent Job and Pod.\n\n```json { \"annotations\": { \"cats.io/inject\": \"enabled\" } } ```",
"description": "Allows setting up custom annotations for the agent Job and Pod.\n\n```json { \"annotations\": { \"cats.io/inject\": \"enabled\" \"prometheus.io/scrape\": \"true\", \"prometheus.io/port\": \"9000\" } } ```",
"type": [
"object",
"null"
Expand Down Expand Up @@ -378,6 +378,14 @@
"null"
]
},
"metrics": {
"title": "agent.metrics {#agent-metrics}",
"description": "Enables prometheus metrics for the agent pod.\n\nYou might need to add annotations to the agent pod depending on how prometheus is configured to scrape for metrics.\n\n```json { \"metrics\": \"0.0.0.0:9000\" } ```",
"type": [
"string",
"null"
]
},
"namespace": {
"title": "agent.namespace {#agent-namespace}",
"description": "Namespace where the agent shall live. Note: Doesn't work with ephemeral containers. Defaults to the current kubernetes namespace.",
Expand Down
3 changes: 3 additions & 0 deletions mirrord/agent/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,8 @@ x509-parser = "0.16"
rustls.workspace = true
envy = "0.4"
socket2.workspace = true
prometheus = { version = "0.13", features = ["process"] }
axum = { version = "0.7", features = ["macros"] }
iptables = { git = "https://github.com/metalbear-co/rust-iptables.git", rev = "e66c7332e361df3c61a194f08eefe3f40763d624" }
rawsocket = { git = "https://github.com/metalbear-co/rawsocket.git" }
procfs = "0.17.0"
Expand All @@ -78,3 +80,4 @@ rstest.workspace = true
mockall = "0.13"
test_bin = "0.4"
rcgen.workspace = true
reqwest.workspace = true
195 changes: 195 additions & 0 deletions mirrord/agent/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,198 @@ Agent part of [mirrord](https://github.com/metalbear-co/mirrord) responsible for
mirrord-agent is written in Rust for safety, low memory consumption and performance.

mirrord-agent is distributed as a container image (currently only x86) that is published on [GitHub Packages publicly](https://github.com/metalbear-co/mirrord-agent/pkgs/container/mirrord-agent).

## Enabling prometheus metrics

To start the metrics server, you'll need to add this config to your `mirrord.json`:

```json
{
"agent": {
"metrics": "0.0.0.0:9000",
"annotations": {
"prometheus.io/scrape": "true",
"prometheus.io/port": "9000"
}
}
```

Remember to change the `port` in both `metrics` and `annotations`, they have to match,
otherwise prometheus will try to scrape on `port: 80` or other commonly used ports.

### Installing prometheus

Run `kubectl apply -f {file-name}.yaml` on these sequences of `yaml` files and you should
get prometheus running in your cluster. You can access the dashboard from your browser at
`http://{cluster-ip}:30909`, if you're using minikube it might be
`http://192.168.49.2:30909`.

You'll get prometheus running under the `monitoring` namespace, but it'll be able to look
into resources from all namespaces. The config in `configmap.yaml` sets prometheus to look
at pods only, if you want to use it to scrape other stuff, check
[this example](https://github.com/prometheus/prometheus/blob/main/documentation/examples/prometheus-kubernetes.yml).

1. `create-namespace.yaml`

```yaml
apiVersion: v1
kind: Namespace
metadata:
name: monitoring
```

2. `cluster-role.yaml`

```yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: prometheus
rules:
- apiGroups: [""]
resources:
- nodes
- services
- endpoints
- pods
verbs: ["get", "list", "watch"]
- apiGroups:
- extensions
resources:
- ingresses
verbs: ["get", "list", "watch"]
```
3. `service-account.yaml`

```yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: prometheus
namespace: monitoring
```

4. `cluster-role-binding.yaml`

```yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: prometheus
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: prometheus
subjects:
- kind: ServiceAccount
name: prometheus
namespace: monitoring
```

5. `configmap.yaml`

```yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: monitoring
data:
prometheus.yml: |
global:
keep_dropped_targets: 100
scrape_configs:
- job_name: "kubernetes-pods"
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: pod
```

- If you make any changes to the 5-configmap.yaml file, remember to `kubectl apply` it
**before** restarting the `prometheus` deployment.

6. `deployment.yaml`

```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus
namespace: monitoring
labels:
app: prometheus
spec:
replicas: 1
strategy:
rollingUpdate:
maxSurge: 1
maxUnavailable: 1
type: RollingUpdate
selector:
matchLabels:
app: prometheus
template:
metadata:
labels:
app: prometheus
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9090"
spec:
serviceAccountName: prometheus
containers:
- name: prometheus
image: prom/prometheus
args:
- '--config.file=/etc/prometheus/prometheus.yml'
ports:
- name: web
containerPort: 9090
volumeMounts:
- name: prometheus-config-volume
mountPath: /etc/prometheus
restartPolicy: Always
volumes:
- name: prometheus-config-volume
configMap:
defaultMode: 420
name: prometheus-config
```

7. `service.yaml`

```yaml
apiVersion: v1
kind: Service
metadata:
name: prometheus-service
namespace: monitoring
annotations:
prometheus.io/scrape: 'true'
prometheus.io/port: '9090'
spec:
selector:
app: prometheus
type: NodePort
ports:
- port: 8080
targetPort: 9090
nodePort: 30909
```
9 changes: 8 additions & 1 deletion mirrord/agent/src/cli.rs
Original file line number Diff line number Diff line change
@@ -1,8 +1,11 @@
#![deny(missing_docs)]

use std::net::SocketAddr;

use clap::{Parser, Subcommand};
use mirrord_protocol::{
MeshVendor, AGENT_IPV6_ENV, AGENT_NETWORK_INTERFACE_ENV, AGENT_OPERATOR_CERT_ENV,
MeshVendor, AGENT_IPV6_ENV, AGENT_METRICS_ENV, AGENT_NETWORK_INTERFACE_ENV,
AGENT_OPERATOR_CERT_ENV,
};

const DEFAULT_RUNTIME: &str = "containerd";
Expand All @@ -28,6 +31,10 @@ pub struct Args {
#[arg(short = 'i', long, env = AGENT_NETWORK_INTERFACE_ENV)]
pub network_interface: Option<String>,

/// Controls whether metrics are enabled, and the address to set up the metrics server.
#[arg(long, env = AGENT_METRICS_ENV)]
pub metrics: Option<SocketAddr>,

/// Return an error after accepting the first client connection, in order to test agent error
/// cleanup.
///
Expand Down
18 changes: 17 additions & 1 deletion mirrord/agent/src/client_connection.rs
Original file line number Diff line number Diff line change
Expand Up @@ -208,7 +208,7 @@ enum ConnectionFramed {

#[cfg(test)]
mod test {
use std::sync::Arc;
use std::sync::{Arc, Once};

use futures::StreamExt;
use mirrord_protocol::ClientCodec;
Expand All @@ -220,10 +220,19 @@ mod test {

use super::*;

static CRYPTO_PROVIDER: Once = Once::new();

/// Verifies that [`AgentTlsConnector`] correctly accepts a
/// connection from a server using the provided certificate.
#[tokio::test]
async fn agent_tls_connector_valid_cert() {
CRYPTO_PROVIDER.call_once(|| {
rustls::crypto::CryptoProvider::install_default(
rustls::crypto::aws_lc_rs::default_provider(),
)
.expect("Failed to install crypto provider")
});

let cert = rcgen::generate_simple_self_signed(vec!["operator".to_string()]).unwrap();
let cert_bytes = cert.cert.der();
let key_bytes = cert.key_pair.serialize_der();
Expand Down Expand Up @@ -269,6 +278,13 @@ mod test {
/// connection from a server using some other certificate.
#[tokio::test]
async fn agent_tls_connector_invalid_cert() {
CRYPTO_PROVIDER.call_once(|| {
rustls::crypto::CryptoProvider::install_default(
rustls::crypto::aws_lc_rs::default_provider(),
)
.expect("Failed to install crypto provider")
});

let server_cert = rcgen::generate_simple_self_signed(vec!["operator".to_string()]).unwrap();
let cert_bytes = server_cert.cert.der();
let key_bytes = server_cert.key_pair.serialize_der();
Expand Down
4 changes: 2 additions & 2 deletions mirrord/agent/src/container_handle.rs
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
use std::{collections::HashMap, sync::Arc};

use crate::{
error::Result,
error::AgentResult,
runtime::{Container, ContainerInfo, ContainerRuntime},
};

Expand All @@ -22,7 +22,7 @@ pub(crate) struct ContainerHandle(Arc<Inner>);
impl ContainerHandle {
/// Retrieve info about the container and initialize this struct.
#[tracing::instrument(level = "trace")]
pub(crate) async fn new(container: Container) -> Result<Self> {
pub(crate) async fn new(container: Container) -> AgentResult<Self> {
let ContainerInfo { pid, env: raw_env } = container.get_info().await?;

let inner = Inner { pid, raw_env };
Expand Down
18 changes: 8 additions & 10 deletions mirrord/agent/src/dns.rs
Original file line number Diff line number Diff line change
Expand Up @@ -16,10 +16,7 @@ use tokio::{
use tokio_util::sync::CancellationToken;
use tracing::Level;

use crate::{
error::{AgentError, Result},
watched_task::TaskStatus,
};
use crate::{error::AgentResult, metrics::DNS_REQUEST_COUNT, watched_task::TaskStatus};

#[derive(Debug)]
pub(crate) enum ClientGetAddrInfoRequest {
Expand Down Expand Up @@ -167,6 +164,9 @@ impl DnsWorker {
let etc_path = self.etc_path.clone();
let timeout = self.timeout;
let attempts = self.attempts;

DNS_REQUEST_COUNT.fetch_add(1, std::sync::atomic::Ordering::Relaxed);

let support_ipv6 = self.support_ipv6;
let lookup_future = async move {
let result = Self::do_lookup(
Expand All @@ -181,15 +181,13 @@ impl DnsWorker {
if let Err(result) = message.response_tx.send(result) {
tracing::error!(?result, "Failed to send query response");
}
DNS_REQUEST_COUNT.fetch_sub(1, std::sync::atomic::Ordering::Relaxed);
};

tokio::spawn(lookup_future);
}

pub(crate) async fn run(
mut self,
cancellation_token: CancellationToken,
) -> Result<(), AgentError> {
pub(crate) async fn run(mut self, cancellation_token: CancellationToken) -> AgentResult<()> {
loop {
tokio::select! {
_ = cancellation_token.cancelled() => break Ok(()),
Expand Down Expand Up @@ -225,7 +223,7 @@ impl DnsApi {
pub(crate) async fn make_request(
&mut self,
request: ClientGetAddrInfoRequest,
) -> Result<(), AgentError> {
) -> AgentResult<()> {
let (response_tx, response_rx) = oneshot::channel();

let command = DnsCommand {
Expand All @@ -244,7 +242,7 @@ impl DnsApi {
/// Returns the result of the oldest outstanding DNS request issued with this struct (see
/// [`Self::make_request`]).
#[tracing::instrument(level = Level::TRACE, skip(self), ret, err)]
pub(crate) async fn recv(&mut self) -> Result<GetAddrInfoResponse, AgentError> {
pub(crate) async fn recv(&mut self) -> AgentResult<GetAddrInfoResponse> {
let Some(response) = self.responses.next().await else {
return future::pending().await;
};
Expand Down
Loading

0 comments on commit 57bdab5

Please sign in to comment.