Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streamline configuration & install Prometheus and Grafana as subcharts #40

Merged
merged 65 commits into from
Feb 13, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
65 commits
Select commit Hold shift + click to select a range
89dbca5
install Grafana via subchart
kondratyevd Feb 7, 2025
5ed3401
Update JSON schema
actions-user Feb 7, 2025
640d47c
Merge branch 'subcharts' of github.com:fastmachinelearning/SuperSONIC…
kondratyevd Feb 7, 2025
81b4b99
Update helm docs
actions-user Feb 7, 2025
9ca10e2
Merge branch 'subcharts' of github.com:fastmachinelearning/SuperSONIC…
kondratyevd Feb 7, 2025
0df5761
add dependency in ci
kondratyevd Feb 7, 2025
19835ca
add dependency in ci
kondratyevd Feb 7, 2025
3839dcf
update helmignore
kondratyevd Feb 7, 2025
7b06cc3
update helpers
kondratyevd Feb 7, 2025
c33d4d2
disable rbac validation and simplify prometheus helpers for now
kondratyevd Feb 7, 2025
248fda3
fix NOTES
kondratyevd Feb 7, 2025
ff9201a
fix CI
kondratyevd Feb 7, 2025
1b33d5a
temporary fix for CI
kondratyevd Feb 7, 2025
e984033
install Prometheus as subchart
kondratyevd Feb 11, 2025
79bb3b1
Update JSON schema
actions-user Feb 12, 2025
aa570b2
Update helm docs
actions-user Feb 12, 2025
b31957f
Merge branch 'subcharts' of github.com:fastmachinelearning/SuperSONIC…
kondratyevd Feb 12, 2025
2b73a37
remove grafana-legacy configs
kondratyevd Feb 12, 2025
0a77eb3
Update JSON schema
actions-user Feb 12, 2025
45c5187
Update helm docs
actions-user Feb 12, 2025
23cf464
Merge branch 'subcharts' of github.com:fastmachinelearning/SuperSONIC…
kondratyevd Feb 12, 2025
0b6e3cd
streamline prometheus configuration
kondratyevd Feb 12, 2025
8491a36
Update JSON schema
actions-user Feb 12, 2025
51816a1
Update helm docs
actions-user Feb 12, 2025
5526d1f
Merge branch 'subcharts' of github.com:fastmachinelearning/SuperSONIC…
kondratyevd Feb 12, 2025
cf05661
update grafana helpers
kondratyevd Feb 12, 2025
9f9c27d
add parameter validation for prometheus and grafana
kondratyevd Feb 12, 2025
5669da9
disable prometheus ingress by default
kondratyevd Feb 12, 2025
36da763
Update JSON schema
actions-user Feb 12, 2025
1d4654f
Update helm docs
actions-user Feb 12, 2025
2036ab9
Merge branch 'subcharts' of github.com:fastmachinelearning/SuperSONIC…
kondratyevd Feb 12, 2025
c416ce3
add a ci script for local testing, fix GH CI
kondratyevd Feb 12, 2025
8b38dcd
add Chart.lock
kondratyevd Feb 12, 2025
91eb2f3
update Chart.lock
kondratyevd Feb 12, 2025
ae4e4ba
update CI
kondratyevd Feb 12, 2025
dcda352
update CI
kondratyevd Feb 12, 2025
2b825b0
udpate CI
kondratyevd Feb 12, 2025
e857b0b
update .gitattributes
kondratyevd Feb 12, 2025
33b0214
update values
kondratyevd Feb 12, 2025
12ec6e6
fix CI
kondratyevd Feb 12, 2025
e12c4da
fix CI
kondratyevd Feb 12, 2025
f0091bd
move ingress configuration under envoy:
kondratyevd Feb 12, 2025
6a5f4fb
Update JSON schema
actions-user Feb 12, 2025
5e6efb3
Update helm docs
actions-user Feb 12, 2025
76f0c1a
Merge branch 'subcharts' of github.com:fastmachinelearning/SuperSONIC…
kondratyevd Feb 12, 2025
eed7123
fix grpcEndpoint helper
kondratyevd Feb 12, 2025
548e16d
clean up values files
kondratyevd Feb 12, 2025
512a73d
improve grafana and prometheus logic
kondratyevd Feb 13, 2025
0645653
update dashboard
kondratyevd Feb 13, 2025
1e12681
improve monitoring
kondratyevd Feb 13, 2025
9cfa749
add validation for grafana datasources
kondratyevd Feb 13, 2025
f361bc8
add grafana datasources to values files
kondratyevd Feb 13, 2025
1e9207e
add missing labels
kondratyevd Feb 13, 2025
3b9a4df
prevent from deploying ingresses with duplicate names
kondratyevd Feb 13, 2025
0f32c8e
further streamline *.tpl helpers
kondratyevd Feb 13, 2025
176049c
update values file
kondratyevd Feb 13, 2025
df79c65
update documentation
kondratyevd Feb 13, 2025
e4dc52a
update README
kondratyevd Feb 13, 2025
3145143
Update helm docs
actions-user Feb 13, 2025
3679233
Merge branch 'subcharts' of github.com:fastmachinelearning/SuperSONIC…
kondratyevd Feb 13, 2025
ceebe41
update diagram
kondratyevd Feb 13, 2025
ab93e8c
Merge branch 'main' of github.com:fastmachinelearning/SuperSONIC
kondratyevd Feb 13, 2025
0d1da03
Resolve conflicts and merge branch 'subcharts'
kondratyevd Feb 13, 2025
eead4f6
clean up
kondratyevd Feb 13, 2025
1566083
removing the whitespace removal for the auth cluster section of envoy…
kondratyevd Feb 13, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .gitattributes
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
*.json linguist-detectable
*.yml linguist-detectable
*.yaml linguist-detectable
*.yaml linguist-detectable
*.tpl linguist-language=Go
9 changes: 6 additions & 3 deletions .github/workflows/ci-github-cms.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,9 @@ jobs:

- name: Deploy Helm chart
run: |
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
helm dependency build ./helm/supersonic
helm upgrade --install supersonic ./helm/supersonic \
--values values/values-cms-ci.yaml -n cms

Expand All @@ -64,12 +67,12 @@ jobs:

- name: Prometheus ready
run: |
kubectl wait --for condition=Ready pod -l app.kubernetes.io/component=prometheus --timeout 120s -n cms
kubectl get svc,pod -l app.kubernetes.io/component=prometheus -n cms
kubectl wait --for condition=Ready pod -l app.kubernetes.io/name=prometheus --timeout 120s -n cms
kubectl get svc,pod -l app.kubernetes.io/name=prometheus -n cms

- name: Grafana ready
run: |
kubectl wait --for condition=Ready pod -l app.kubernetes.io/component=grafana --timeout 120s -n cms
kubectl wait --for condition=Ready pod -l app.kubernetes.io/name=grafana --timeout 120s -n cms

- name: Triton server ready
run: |
Expand Down
93 changes: 93 additions & 0 deletions .github/workflows/ci-local.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
#!/bin/bash

echo "Starting deployment process..."

# 1. Create a Kubernetes cluster with Kind
echo "Creating Kind cluster..."
kind create cluster --name gh-k8s-cluster

# 2. (Assuming Helm is installed and at the proper version)

# 3. Create CMS namespace
echo "Creating CMS namespace..."
kubectl create namespace cms

# 4. Install Prometheus Operator CRDs
echo "Installing Prometheus Operator CRDs..."
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
kubectl create namespace monitoring
helm install prometheus-operator prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--set prometheusOperator.createCustomResource=false \
--set defaultRules.create=false \
--set alertmanager.enabled=false \
--set prometheus.enabled=false \
--set grafana.enabled=false

# 5. Install KEDA Autoscaler
echo "Installing KEDA Autoscaler..."
helm repo add kedacore https://kedacore.github.io/charts
helm repo update
kubectl create namespace keda
helm install keda kedacore/keda --namespace keda

# 6. Mount CVMFS
echo "Mounting CVMFS..."
kubectl create namespace cvmfs-csi
helm install -n cvmfs-csi cvmfs-csi oci://registry.cern.ch/kubernetes/charts/cvmfs-csi \
--values ci/values-cvmfs-csi.yaml
kubectl apply -f ci/cvmfs-storageclass.yaml -n cvmfs-csi

# 7. Deploy the Helm chart for supersonic
echo "Deploying Helm chart for supersonic..."
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
helm dependency build ./helm/supersonic
helm upgrade --install supersonic ./helm/supersonic --values values/values-cms-ci.yaml -n cms

# 8. Wait for components to become ready

echo "Waiting for CVMFS pods to be ready..."
kubectl wait --for=condition=Ready pod --all -n cvmfs-csi --timeout 120s

echo "Waiting for Envoy proxy pods to be ready..."
kubectl wait --for=condition=Ready pod -l app.kubernetes.io/component=envoy --timeout 120s -n cms

echo "Waiting for Prometheus pods to be ready..."
kubectl wait --for=condition=Ready pod -l app.kubernetes.io/name=prometheus --timeout 120s -n cms
kubectl get svc,pod -l app.kubernetes.io/name=prometheus -n cms

echo "Waiting for Grafana pods to be ready..."
kubectl wait --for=condition=Ready pod -l app.kubernetes.io/name=grafana --timeout 120s -n cms

echo "Waiting for Triton server pods to be ready..."
kubectl wait --for=condition=Ready pod -l app.kubernetes.io/component=triton --timeout 300s -n cms

echo "Waiting for KEDA Autoscaler to be ready..."
kubectl wait --for=condition=AbleToScale hpa -l app.kubernetes.io/component=keda --timeout 120s -n cms
kubectl wait --for=condition=Ready so -l app.kubernetes.io/component=keda --timeout 120s -n cms

# 9. Validate the Deployment
echo "Validating Deployment in 'cms' namespace..."
kubectl get all -n cms

# 10. Run Perf Analyzer Job
echo "Running Perf Analyzer Job..."
kubectl apply -f ci/perf-analyzer-job.yaml
kubectl wait --for=condition=complete job/perf-analyzer-job -n cms --timeout=180s || {
echo "Perf-analyzer job did not complete in time or failed."
exit 1
}

# Retrieve and print the logs from the Perf Analyzer pod
POD_NAME=$(kubectl get pods -n cms -l job-name=perf-analyzer-job -o jsonpath="{.items[0].metadata.name}")
echo "========== Perf Analyzer Logs =========="
kubectl logs -n cms "$POD_NAME"
echo "========================================"

# 11. Cleanup the Kind cluster
echo "Cleaning up: Deleting Kind cluster..."
kind delete cluster --name gh-k8s-cluster

echo "Deployment process completed successfully!"
4 changes: 4 additions & 0 deletions .github/workflows/helm-lint.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,10 @@ jobs:

- name: Lint values.yaml files in values/ directory
run: |
helm repo add grafana https://grafana.github.io/helm-charts
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm dependency build ./helm/supersonic
CHART_PATH="helm/supersonic/"
VALUES_DIR="values/"

Expand Down
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,2 +1,4 @@
# Sphinx Documentation
docs/_build
docs/_build

*.tgz
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,8 @@ The main components of SuperSONIC are:

```
helm repo add fastml https://fastmachinelearning.org/SuperSONIC
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
helm install <release-name> fastml/supersonic --values <your-values.yaml> -n <namespace>
```
Expand Down
40 changes: 23 additions & 17 deletions docs/.values-table.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,8 @@
| Key | Type | Default | Description |
|-----|------|---------|-------------|
| nameOverride | string | `""` | Unique identifier of SuperSONIC instance (equal to release name by default) |
| serverLoadMetric | string | `""` | A metric used by both KEDA autoscaler and Envoy's prometheus-based rate limiter. # Default metric (inference queue latency) is defined in templates/_helpers.tpl |
| serverLoadThreshold | int | `100` | Threshold for the metric |
| triton.replicas | int | `1` | Number of Triton server instances (if autoscaling is disabled) |
| triton.image | string | `"nvcr.io/nvidia/tritonserver:24.12-py3-min"` | Docker image for the Triton server |
| triton.command | list | `["/bin/sh","-c"]` | Command and arguments to run in Triton container |
Expand All @@ -22,6 +24,7 @@
| envoy.resources | object | `{"limits":{"cpu":2,"memory":"4G"},"requests":{"cpu":1,"memory":"2G"}}` | Resource requests and limits for Envoy Proxy. Note: an Envoy Proxy with too many connections might run out of CPU |
| envoy.service.type | string | `"ClusterIP"` | This is the client-facing endpoint. In order to be able to connect to it, either enable ingress, or use type: LoadBalancer. |
| envoy.service.ports | list | `[{"name":"grpc","port":8001,"targetPort":8001},{"name":"admin","port":9901,"targetPort":9901}]` | Envoy Service ports |
| envoy.ingress | object | `{"annotations":{},"enabled":false,"hostName":"","ingressClassName":""}` | Ingress configuration for Envoy |
| envoy.grpc_route_timeout | string | `"0s"` | Timeout for gRPC route in Envoy; disabled by default (0s), preventing Envoy from closing connections too early. |
| envoy.rate_limiter.listener_level | object | `{"enabled":false,"fill_interval":"12s","max_tokens":5,"tokens_per_fill":1}` | This rate limiter explicitly controls the number of client connections to the Envoy Proxy. |
| envoy.rate_limiter.listener_level.enabled | bool | `false` | Enable rate limiter |
Expand All @@ -47,22 +50,25 @@
| autoscaler.scaleDown.window | int | `600` | |
| autoscaler.scaleDown.period | int | `120` | |
| autoscaler.scaleDown.stepsize | int | `1` | |
| prometheus | object | `{"external":true,"ingress":{"annotations":{},"enabled":false,"hostName":"","ingressClassName":""},"port":443,"scheme":"https","serverLoadMetric":"","serverLoadThreshold":100,"url":""}` | Connection to a Prometheus server is required for KEDA autoscaler and Envoy's prometheus-based rate limiter |
| prometheus.external | bool | `true` | Whether to use external Prometheus instance (true) or deploy internal one (false) |
| prometheus.url | string | `""` | External Prometheus server url and port number (find in documentation of a given cluster or ask admins) Only used when external=true |
| prometheus.scheme | string | `"https"` | Specify whether external Prometheus endpoint is exposed as http or https Only used when external=true |
| prometheus.serverLoadMetric | string | `""` | A metric used by both KEDA autoscaler and Envoy's prometheus-based rate limiter. # Default metric (inference queue latency) is defined in templates/_helpers.tpl |
| prometheus.serverLoadThreshold | int | `100` | Threshold for the metric |
| prometheus.ingress | object | `{"annotations":{},"enabled":false,"hostName":"","ingressClassName":""}` | Ingress configuration for internal Prometheus web UI (only used when external=false) |
| ingress.enabled | bool | `false` | |
| ingress.hostName | string | `""` | |
| ingress.ingressClassName | string | `""` | |
| ingress.annotations | object | `{}` | |
| nodeSelector | object | `{}` | Node selector for all pods (Triton and Envoy) |
| tolerations | list | `[]` | Tolerations for all pods (Triton and Envoy) |
| grafana.enabled | bool | `false` | Enable or disable Grafana deployment |
| grafana.ingress | object | `{"annotations":{},"enabled":false,"hostName":"","ingressClassName":"haproxy"}` | Ingress configuration for Grafana |
| grafana.ingress.enabled | bool | `false` | Enable or disable ingress for Grafana |
| grafana.ingress.hostName | string | `""` | Hostname for Grafana ingress |
| grafana.ingress.ingressClassName | string | `"haproxy"` | Ingress class name (e.g. nginx, haproxy) |
| grafana.ingress.annotations | object | `{}` | Additional annotations for Grafana ingress |
| prometheus | object | `{"alertmanager":{"enabled":false},"configmapReload":{"prometheus":{"enabled":false}},"enabled":false,"external":{"enabled":false,"port":443,"scheme":"https","url":""},"kube-state-metrics":{"enabled":false},"prometheus-node-exporter":{"enabled":false},"prometheus-pushgateway":{"enabled":false},"pushgateway":{"enabled":false},"rbac":{"create":false},"server":{"configMapOverrideName":"prometheus-config","global":{"evaluation_interval":"5s","scrape_interval":"5s"},"ingress":{"annotations":{},"enabled":false,"hosts":[],"ingressClassName":"","tls":[{"hosts":[]}]},"persistentVolume":{"enabled":false},"releaseNamespace":true,"resources":{"limits":{"cpu":1,"memory":"1Gi"},"requests":{"cpu":"500m","memory":"512Mi"}},"retention":"15d","service":{"enabled":true,"servicePort":9090},"useExistingClusterRoleName":"supersonic-prometheus-role"},"serviceAccounts":{"server":{"create":false,"name":"supersonic-prometheus-sa"}}}` | Connection to a Prometheus server is required for KEDA autoscaler and Envoy's prometheus-based rate limiter |
| prometheus.external.enabled | bool | `false` | Enable external Prometheus instance |
| prometheus.external.url | string | `""` | External Prometheus server url |
| prometheus.external.port | int | `443` | External Prometheus server port number |
| prometheus.external.scheme | string | `"https"` | Specify whether external Prometheus endpoint is exposed as http or https |
| prometheus.enabled | bool | `false` | Enable or disable Prometheus subchart deployment |
| prometheus.server | object | `{"configMapOverrideName":"prometheus-config","global":{"evaluation_interval":"5s","scrape_interval":"5s"},"ingress":{"annotations":{},"enabled":false,"hosts":[],"ingressClassName":"","tls":[{"hosts":[]}]},"persistentVolume":{"enabled":false},"releaseNamespace":true,"resources":{"limits":{"cpu":1,"memory":"1Gi"},"requests":{"cpu":"500m","memory":"512Mi"}},"retention":"15d","service":{"enabled":true,"servicePort":9090},"useExistingClusterRoleName":"supersonic-prometheus-role"}` | Prometheus Helm chart configuration (https://github.com/prometheus-community/helm-charts/tree/main/charts/prometheus) |
| grafana.enabled | bool | `false` | |
| grafana.adminUser | string | `"admin"` | |
| grafana.adminPassword | string | `"admin"` | |
| grafana.persistence.enabled | bool | `false` | |
| grafana.rbac.create | bool | `false` | |
| grafana.serviceAccount.create | bool | `false` | |
| grafana.datasources | object | `{"datasources.yaml":{"apiVersion":1,"datasources":[{"access":"proxy","isDefault":true,"jsonData":{"timeInterval":"5s","tlsSkipVerify":true},"name":"prometheus","type":"prometheus","url":"http://supersonic-prometheus-server:9090"}]}}` | Grafana datasources configuration |
| grafana.dashboardProviders | object | `{"dashboardproviders.yaml":{"apiVersion":1,"providers":[{"disableDeletion":false,"editable":true,"folder":"","name":"default","options":{"path":"/var/lib/grafana/dashboards/default"},"orgId":1,"type":"file"}]}}` | Grafana dashboard providers configuration |
| grafana.dashboardsConfigMaps | object | `{"default":"supersonic-grafana-default-dashboard"}` | Grafana dashboard ConfigMaps |
| grafana."grafana.ini" | object | `{"auth":{"disable_login_form":true},"auth.anonymous":{"enabled":true,"org_role":"Admin"},"dashboards":{"default_home_dashboard_path":"/var/lib/grafana/dashboards/default/default.json"}}` | Grafana.ini configuration |
| grafana.resources | object | `{"limits":{"cpu":1,"memory":"1Gi"},"requests":{"cpu":"100m","memory":"128Mi"}}` | Resource limits and requests for Grafana |
| grafana.service | object | `{"port":80,"targetPort":3000,"type":"ClusterIP"}` | Service configuration |
| grafana.ingress | object | `{"annotations":{},"enabled":false,"hosts":[],"ingressClassName":"","path":"/","pathType":"ImplementationSpecific","tls":[]}` | Ingress configuration |
Loading
Loading