- ACM Observability Monitoring Grafana Dashboard
- Multi Cluster Logging with Loki Operator
- OpenShift Data Foundations Ceph Storage Percent Used in OpenShift Monitoring
- Design: Logging System
- Design: Observability Architecture
- Monitoring in the cluster is provided by the Red Hat Advanced Cluster Management Operator here.
- The ACM MultiClusterObservability component allows us to configure the storage class, storage size, rule storage size, receive storage size, compact storage size, alert manager storage size, metric object storage bucket, interval, and downsampling of the observability. It also allows us to configure the replicas and node selectors for each of the observability components (store, receive, grafana, query, alert manager, store memcached, RBAC query proxy, observatorium API, query frontend, rule, and query frontend memcached. See the Multi Cluster Observability component here.
- You can find the observability components in the open-cluster-management-observability namespace here.
In case you encounter errors with ACM upgrades and Observability, it may be due to the multicluster-engine being out of sync with old ACM data:
E1212 15:28:42.757827 1 helmreleasemgr.go:99] failed to download chart from helm repo. - url: http://multiclusterhub-repo.open-cluster-management.svc.cluster.local:3000/charts/policyreport-2.5.3.tgz error: return code: 404 unable to retrieve chart - Failed to download the chart
error validating existing CRs against new CRD's schema for "multiclusterobservabilities.observability.open-cluster-management.io": error listing resources in GroupVersionResource schema.GroupVersionResource{Group:"observability.open-cluster-management.io", Version:"v1beta1", Resource:"multiclusterobservabilities"}: conversion webhook for observability.open-cluster-management.io/v1beta2, Kind=MultiClusterObservability failed: Post "https://multicluster-observability-webhook-service.open-cluster-management.svc:443/convert?timeout=30s": no endpoints available for service "multicluster-observability-webhook-service"
It requires deleting the multicluster-engine Subscription and CSV, deleting the openshift-monitoring pods, and deleting the ACM Subscription, CSV, and MultiClusterObservability CRD, and putting it all back again.
Monitoring and logging for the infrastructure hardware and software that is not OpenShift (for example Grafana).
As a NERC administrator, I should be able to monitor the status of any infrastructure software or hardware that supports operations for the NERC OpenShift environment, even if it is not itself part of OpenShift.
You can access many metrics for pods of applications in a namespace. See some of the available logs and metrics:
- Click here to visit the cpu usage logs for dex.
- Click here to visit the cpu usage logs for gitops.
- Click here to visit the cpu usage logs for grafana.
- Click here to visit the cpu usage logs for logging.
- Click here to visit the cpu usage logs for loki.
- Click here to visit the cpu usage logs for vault.
- Click here to visit the cpu usage logs for xdmod.
As an administrator of the cluster, I should be able to view daily, weekly, and monthly reports of the cluster infrastructure utilization.
- Administrator logs into the associated XDMoD instance and views reports.
- Click here to view the ACM Observability Grafana dashboards. These dashboards provide insights into Control Plane Health, Optimization, Capacity, Utilization, and more. You can change the timespan in the top right to show results in terms of minutes, hours, days, months or years.
As a user and the owner of a project, I should be able to view daily, weekly, and monthly reports of the infrastructure utilization by the projects I own.
- User logs into the associated XDMoD instance and views reports for projects they own.
- Users cannot view reports for projects they do not own. We will need to look into this, to restrict the view to only projects that they own.
- Click here to view the memory usage of projects over time.
- Click here to view the CPU usage of the projects over time.
- Click here to show the projects using the top 5 CPU usage at each point in time.
Log archiving and rollover could run the Ceph Storage out of space. Because the metrics to calculate space on the ceph cluster are not yet sent to Observability, they are available in the OpenShift Monitoring instead. Check on log storage space consumed vs. available using these OpenShift metrics:
Here are some useful links to the MultiClusterObservability documentation:
- APIs Red Hat Advanced Cluster Management for Kubernetes 2.5
- Observing environments introduction Red Hat Advanced Cluster Management for Kubernetes 2.5
- Managing applications Red Hat Advanced Cluster Management for Kubernetes 2.0
- Logging in the cluster is provided by the Red Hat Red Hat OpenShift Logging here.
- We combine the OpenShift Logging Operator with the Loki Operator here, so that the Logging Operator sends the infrastructure, audit, and application logs to the Loki Operator where they are stored in an Object Bucket.
- The OpenShift Logging Operator has a dependency on the Elasticsearch Operator here. Whether you use Elasticsearch for storing logs or using Loki, you still need the Elasticsearch Operator installed for required dependent CustomResourceDefinitions.
- The Loki Operator allows you to set up LokiStacks, AlertingRules, RecordingRules, and RulerConfigs based on your cluster logs for infrastructure, audit, and applications. See the Loki Operator here
- Setting up a LokiStack allows you to configure the size of a cluster logging system that you desire in terms of storage and replicas. LokiStack here
- Setting up a LokiStack involves configuring persistent storage by storageClassName for Persistent Volume Claims. ocs-external-storagecluster-ceph-rbd storage class here
- Setting up a LokiStack involves configuring object storage by a secret named "thanos-object-storage" in the "openshift-logging" namespace containing the access_key_id, access_key_secret, bucketnames, and endpoint of the object storage.
- The object storage for Loki is provided by OpenShift Data Foundations. See the openshift-logging-objectbucketclaim Object Bucket Claim here
- The The infra and prod Cluster Logs are available on the infra cluster here
As an administrator of the cluster, I should be able to track all the events in the cluster using the logging system in OpenShift.
- Click here to visit the Logs.
- You can easily filter by recent date, or date range in the past.
- You can easily filter by content, namespaces, pods, and containers.
- You can also filter by log levels: critical, error, warning, info, debug, trace, unknown.
- Click "Show Query" to add more advanced filters like cluster ID:
- Here are the logs for the infra cluster, you can also add the following query to the end of your log query to filter on infra cluster logs:
| openshift_cluster_id="b3c6e302-f119-4adb-bc48-e04c6aa2eaa5"
- Here are the logs for the prod cluster, you can also add the following query to the end of your log query to filter on infra cluster logs:
| openshift_cluster_id="fcb727d6-3e61-4d23-913d-756cf41c7982"
- Here are the logs for the infra cluster, you can also add the following query to the end of your log query to filter on infra cluster logs:
- NERC Admins have access to application logs.
- Infrastructure and audit logs have always been reserved to cluster admins in OpenShift Logging ( even on the old stack with Elasticsearch). LokiStack is best configured for admin access via a group (currently we support three dedicated names cluster-admin, dedicated-admin and the standard group for kubeadmin). These groups require a ClusterRoleBinding to the ClusterAdmin ClusterRole.
Here are some useful links to the MultiClusterObservability documentation:
- Chapter 7. Forwarding logs to external third-party logging systems OpenShift Container Platform 4.10
- Logging OpenShift Container Platform 4.10
- Exported fields | Logging | OpenShift Container Platform 4.10
- Deploying Cluster Logging
- Multi-tenancy | Grafana Loki documentation
- Grafana Configuration
- HTTP API | Grafana Loki documentation
- Forwarding Logs to LokiStack - Loki Operator
- API - Loki Operator
- Configure generic OAuth authentication | Grafana documentation