Skip to content

Commit

Permalink
Merge pull request #1223 from run-ai/maintenance-overview
Browse files Browse the repository at this point in the history
maintenance
  • Loading branch information
yarongol authored Nov 11, 2024
2 parents dc5b5c4 + 676cdca commit 470f5d0
Show file tree
Hide file tree
Showing 2 changed files with 32 additions and 0 deletions.
31 changes: 31 additions & 0 deletions docs/admin/maintenance/overview.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# Monitoring and maintenance Overview

Deploying Run:ai in mission-critical environments requires proper monitoring and maintenance of resources to ensure workloads run and are deployed as expected.

Details on how to monitor different parts of the physical resources in your Kubernetes system, including [clusters](../config/clusters.md) and [nodes](../../platform-admin/aiinitiatives/resources/nodes.md), can be found in the monitoring and maintenance section. Adjacent configuration and troubleshooting sections also cover [high availability](../config/ha.md), [restoring](../config/dr.md) and [securing](../config/secure-cluster.md) clusters, [collecting logs](../troubleshooting/logs-collection.md), and [reviewing audit logs](./audit-log.md) to meet compliance requirements.

In addition to monitoring Run:ai resources, it is also highly recommended to monitor Run:ai runs on Kubernetes, which manages containerized applications. In particular, focus on three main layers:

## Run:ai Control Plane and cluster services

This is the highest layer and includes the parts of Run:ai pods, which run in containers managed by Kubernetes.

## Kubernetes cluster

This layer includes the main Kubernetes system that runs and manages Run:ai components. Important elements to monitor include:

* The health of the cluster and nodes (machines in the cluster).
* The status of key Kubernetes services, such as the API server. For detailed information on managing clusters, see the [official Kubernetes documentation](https://kubernetes.io/docs/tasks/debug/debug-cluster/resource-usage-monitoring/){target=_blank}.

## Host infrastructure

This is the base layer, representing the actual machines (virtual or physical) that make up the cluster IT teams need to handle:

* Managing CPU, memory, and storage
* Keeping the operating system updated
* Setting up the network and balancing the load

Run:ai does not require any special configurations at this level.

The articles below explain how to monitor these layers, maintain system security and compliance, and ensure the reliable operation of Run:ai in critical environments.

1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -263,6 +263,7 @@ nav:
- 'Mark Assets for Run:ai' : 'admin/config/create-k8s-assets-in-advance.md'

- 'Maintenance' :
- 'Monitoring and maintenance Overview' : 'admin/maintenance/overview.md'
- 'Node Maintenance' : 'admin/maintenance/node-downtime.md'
- 'System Monitoring' : 'admin/maintenance/alert-monitoring.md'
- 'Audit Log' : 'admin/maintenance/audit-log.md'
Expand Down

0 comments on commit 470f5d0

Please sign in to comment.