Note: This CHANGELOG is only for the monitoring team to track all monitoring related changes. Please see OpenShift release notes for official changes.
- #2409 Remove prometheus-adapter code from CMO
- #2302 Enable feature
extra-scrape-metrics
in Prometheus user-workload - #2319 Allow read-only access to the Alertmanager API (use
monitoring-alertmanager-view
). - #2078 Support exporting VPA metrics from KSM.
- #2022 Add support to switch to metrics server from prometheus-adapter when the
MetricsServer
feature gate is enabled. - #2161 Add
PrometheusRestrictedConfig.RemoteWrite[].SendExemplars
. - #2184 Allow to query alerts of application namespaces as an application user from command line.
- #1937 Disables btrfs collector
- #1910 Add new web console usage metrics
- #1950 Disable CORS headers on Thanos querier by default and add a flag to enable them back.
- #1963 Add nodeExporter settings for network devices list.
- #2049 Remove Kube*QuotaOvercommit alerts.
- #2067 Add options to specify resource requests and limits for all components.
- #1785 Adds support for CollectionProfiles TechPreview
- #1830 Add alert KubePodNotScheduled
- #1843 Node Exporter ignores network interface under name "enP.*".
- #1860 Adds runbook for PrometheusRuleFailures
- #1868 In dashboards unstack diagrams with limit/quota/request.
- #1855 Add nodeExporter.collectors.cpufreq settings.
- #1882 Allow configuring secrets in alertmanager component (platform)
- #1876 Add nodeExporter.collectors.tcpstat settings.
- #1888 Add nodeExporter.collectors.netdev settings.
- #1884 Allow configuring secrets in alertmanager component (UWM)
- #1893 Add nodeExporter.collectors.netclass settings.
- #1894 Add toggle netlink implementation of netclass collector in Node Exporter.
- #1891 Add nodeExporter.collectors.buddyinfo settings.
- #1895 Add nodeExporter.maxProcs setting.
- #1624 Add option to specify TopologySpreadConstraints for Prometheus, Alertmanager, and ThanosRuler.
- #1752 Add option to improve consistency of prometheus-adapter CPU and RAM time series.
- #1803 Add alert TelemeterClientFailures
- #1836 PVC configuration link points to document specific to the cluster version
- #1652 Double scrape interval for all CMO controlled ServiceMonitors on single node deployments
- #1567 Enable validating webhook for AlertmanagerConfig custom resources
- #1557 Removing grafana from monitoring stack
- #1578 Add temporary cluster id label to remotely write relabel configs.
- #1350 Support label scrape limits in user-workload monitoring
- #1601 Expose the /federate endpoint of UWM Prometheus as a service
- #1617 Add Oauth2 setting to PrometheusK8s remoteWrite config
- #1598 Expose Authorization settings for remote write in the CMO configuration
- #1633 Expose the /federate endpoint of UWM Prometheus as a route
- #1638 Expose sigv4 setting to Prometheus remoteWrite
- #1579 Expose retention size settings for Platform Prometheus
- #1630 Expose retention size settings for UWM Prometheus
- #1640 Deploy standalone admission webhook for HA.
- #1651 Allow retention to be configurable for Thanos-Ruler in UWM
- #1467 Add bodysize limit for metric scraping
- #1661 Support deployment of dedicated Alertmanager for user-defined alerts.
- #1682 Support AlertmanagerConfig v1beta1.
- #1509 add NLB usage metrics for network edge
- #1299 Expose /api/v1/labels and /api/v1/labels/*/values endpoint on the Thanos query tenancy port.
- #1529 Expose /api/v1/series endpoint on the Thanos query tenancy port.
- #1402 Drop pod-centric cAdvisor metrics that are available at slice level.
- #1399 Rename ThanosSidecarUnhealthy to ThanosSidecarNoConnectionToStartedPrometheus and make it resilient to WAL replays.
- #1446 Bump Grafana version to 7.5.11
- #1439 Expose PodDisruptionBudget labels from kube-state-metrics metrics.
- #1377 Allow OpenShift users to configure audit logs for prometheus-adapter
- #1481 Removing one of the AlertmanagerClusterFailedToSendAlerts alerts
- #1373 Enable admins to toggle the query_log_file setting for Prometheus.
- #1491 Rename alerts
AggregatedAPIErrors to KubeAggregatedAPIErrors
andAggregatedAPIDown to KubeAggregatedAPIDown
. - #1488 Removing the alert HighlyAvailableWorkloadIncorrectlySpread.
- #1858 Allow suppression of storage alerts via PersistentVolumeClaim label
- #1527 Enable user alerts via AlertManagerConfig to be forwarded to the existing Platform Alertmanager
- #1543 Bump Grafana version to v8.3.4
- #1545 Add ClusterRole to allow editing of AlertManagerConfig
- #1312 Support label to exclude namespaces from user-workload monitoring.
- #1308 Expose remote_write to user for in-cluster deployment and UWM.
- #1241 Add config option to disable Grafana deployment.
- #1278 Add EnforcedTargetLimit option for user-workload Prometheus.
- #1291 Drop high cardinality cAdvisor metrics via kube-prometheus #1250
- #1270 Show a message in the degraded condition when Platform Monitoring Prometheus runs without persistent storage.
- #1241 Allow configuring additional Alertmanagers in User Workload Prometheus and Thanos Ruler.
- #1293 Allow disabling the local Alertmanager.
- #1310 Update Alert Configs, fewer critical alerts with more accurate triggering condition.
- #1324 Allow filtering by job in 'Prometheus/Overview' dashboard.
- #1087 Decrease alert severity to "warning" for ThanosQueryHttpRequestQueryErrorRateHigh and ThanosQueryHttpRequestQueryRangeErrorRateHigh alerts.
- #1087 Increase "for" duration to 1 hour for all Thanos query alerts.
- #1087 Remove ThanosQueryInstantLatencyHigh and ThanosQueryRangeLatencyHigh alerts.
- #1090 Decrease alert severity to "warning" for all Thanos sidecar alerts.
- #1090 Increase "for" duration to 1 hour for all Thanos sidecar alerts.
- #1093 Bump kube-state-metrics to major new release v2.0.0-rc.1. This changes a lot of metrics and flags, see kube-state-metrics CHANGELOG for full changes.
- #1126 Remove deprecated techPreviewUserWorkload field from CMO's configmap.
- #1136 Add recording rule for builds by strategy
- #1210 Bump Grafana version to 7.5.5
- #963 bump mixins to include new etcd alerts
- Added etcdBackendQuotaLowSpace, etcdExcessiveDatabaseGrowth, and etcdHighFsyncDurations critical alert.
- Adjusted NodeClockNotSynchronising, NodeNetworkReceiveErrs, and NodeNetworkTransmitErrs alerts.
- #962 Enable namespace by pod and pod total networking Grafana dashboards.
- #959 Remove memory limits from prometheus-config-reloader in user workload monitoring
- #969 Bump Thanos v0.16.0
- #970 Bump prometheus-operator v0.43.0.
- #971 Enable
hwmon
in node-exporter for hardware sensor data collection - #983 Remove deprecated user workload configuration
- #995 Add logLevel config field to Thanos Query.
- #993 Add metrics + alerts for Thanos sidecars.
- #1013 #1018 Bump and pin jsonnet dependencies:
- prometheus-operator v0.44.1
- Thanos: v0.17.2
- kube-prometheus: release-0.7
- #936 Bump prometheus-operator 0.42.1
- #928 Bump prometheus-operator 0.42:
- #714 Validate new/updated PrometheusRule custom resources against the prometheus-operator rule validation API.
- #799 Rules federation support.
- #800 Collect metrics and implement alerting rules for Thanos querier.
- #804 Allow user workload monitoring configuration ConfigMap to be created in openshift-user-workload-monitoring namespace.
- #736 Expose /api/v1/rules endpoint of Thanos Querier via the 9093 TCP port with multi-tenancy support.
- #854 Change KubeQuotaExceeded to KubeQuotaFullyUsed.
- #859 Remove the
hostport
parameter from the configuration. - #859 Allow users to configure EnforcedSampleLimit for User workload monitoring Prometheus tenant.
- #894 Bump jsonnet dependencies:
- kubernetes-mixin: kubernetes-monitoring/kubernetes-mixin#475: alerts: adjust error message accordingly to recent change
- prometheus-operator: prometheus-operator/kube-prometheus#610: Add PrometheusOperatorListErrors and fix PrometheusOperatorWatchErrors threshold
- etcd: etcd-io/etcd#12122: Documentation/etcd-mixin: Reformulate alerting rules to use
without
rather thanby
- kubelet: prometheus-operator/kube-prometheus#623: Add scraping of endpoint for kubelet probe metrics
- thanos: thanos-io/thanos#2374: mixin: Added critical Rules alerts.
- #898 Bump jsonnet dependencies for kube-mixin:
- Adjusts severity levels of many alerts from critical to warning as they were cause based alerts
- Adjusts KubeStatefulSetUpdateNotRolledOut, KubeDaemonSetRolloutStuck
- Removes KubeAPILatencyHigh and KubeAPIErrorsHigh