Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding ray operator addon support for GKE cluster creation #3584

Merged
merged 5 commits into from
Jan 27, 2025
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions examples/ml-gke.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ vars:
# The following line must be updated for this example to work.
authorized_cidr: <your-ip-address>/32
gcp_public_cidrs_access_enabled: false
enable_ray_operator: false

deployment_groups:
- group: primary
Expand Down Expand Up @@ -57,6 +58,7 @@ deployment_groups:
source: modules/scheduler/gke-cluster
use: [network1, gke_service_account]
settings:
enable_ray_operator: $(vars.enable_ray_operator)
enable_private_endpoint: false # Allows for access from authorized public IPs
gcp_public_cidrs_access_enabled: $(vars.gcp_public_cidrs_access_enabled)
master_authorized_networks:
Expand Down
1 change: 1 addition & 0 deletions modules/scheduler/gke-cluster/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -160,6 +160,7 @@ limitations under the License.
| <a name="input_enable_private_endpoint"></a> [enable\_private\_endpoint](#input\_enable\_private\_endpoint) | (Beta) Whether the master's internal IP address is used as the cluster endpoint. | `bool` | `true` | no |
| <a name="input_enable_private_ipv6_google_access"></a> [enable\_private\_ipv6\_google\_access](#input\_enable\_private\_ipv6\_google\_access) | The private IPv6 google access type for the VMs in this subnet. | `bool` | `true` | no |
| <a name="input_enable_private_nodes"></a> [enable\_private\_nodes](#input\_enable\_private\_nodes) | (Beta) Whether nodes have internal IP addresses only. | `bool` | `true` | no |
| <a name="input_enable_ray_operator"></a> [enable\_ray\_operator](#input\_enable\_ray\_operator) | The status of the Ray operator addon, This feature enables Kubernetes APIs for managing and scaling Ray clusters and jobs. You control and are responsible for managing ray.io custom resources in your cluster. This feature is not compatible with GKE clusters that already have another Ray operator installed. Supports clusters on Kubernetes version 1.29.8-gke.1054000 or later. | `bool` | `false` | no |
| <a name="input_gcp_public_cidrs_access_enabled"></a> [gcp\_public\_cidrs\_access\_enabled](#input\_gcp\_public\_cidrs\_access\_enabled) | Whether the cluster master is accessible via all the Google Compute Engine Public IPs. To view this list of IP addresses look here https://cloud.google.com/compute/docs/faq#find_ip_range | `bool` | `false` | no |
| <a name="input_labels"></a> [labels](#input\_labels) | GCE resource labels to be applied to resources. Key-value pairs. | `map(string)` | n/a | yes |
| <a name="input_maintenance_exclusions"></a> [maintenance\_exclusions](#input\_maintenance\_exclusions) | List of maintenance exclusions. A cluster can have up to three. | <pre>list(object({<br/> name = string<br/> start_time = string<br/> end_time = string<br/> exclusion_scope = string<br/> }))</pre> | `[]` | no |
Expand Down
3 changes: 3 additions & 0 deletions modules/scheduler/gke-cluster/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -205,6 +205,9 @@ resource "google_container_cluster" "gke_cluster" {
parallelstore_csi_driver_config {
enabled = var.enable_parallelstore_csi
}
ray_operator_config {
enabled = var.enable_ray_operator
}
}

timeouts {
Expand Down
6 changes: 6 additions & 0 deletions modules/scheduler/gke-cluster/variables.tf
Original file line number Diff line number Diff line change
Expand Up @@ -145,6 +145,12 @@ variable "enable_parallelstore_csi" {
default = false
}

variable "enable_ray_operator" {
description = "The status of the Ray operator addon, This feature enables Kubernetes APIs for managing and scaling Ray clusters and jobs. You control and are responsible for managing ray.io custom resources in your cluster. This feature is not compatible with GKE clusters that already have another Ray operator installed. Supports clusters on Kubernetes version 1.29.8-gke.1054000 or later."
ankitkinra marked this conversation as resolved.
Show resolved Hide resolved
type = bool
default = false
}

variable "enable_dcgm_monitoring" {
description = "Enable GKE to collect DCGM metrics"
type = bool
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

- name: Assert variables are defined
ansible.builtin.assert:
that:
- region is defined
- custom_vars.project is defined
- cli_deployment_vars.enable_ray_operator is defined

- name: Get cluster credentials for kubectl
delegate_to: localhost
ansible.builtin.command: gcloud container clusters get-credentials {{ deployment_name }} --region {{ region }} --project {{ custom_vars.project }}

- name: Check ray CRDs exists in the cluster
delegate_to: localhost
ansible.builtin.shell: |
kubectl get rayjobs.ray.io
args:
executable: /bin/bash
changed_when: False
30 changes: 30 additions & 0 deletions tools/cloud-build/daily-tests/tests/ml-gke-ray.yml
raushan2016 marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
---
test_name: ml-gke-ray
deployment_name: ml-gke-ray-{{ build }}
region: asia-southeast1
zone: asia-southeast1-b # for remote node
workspace: /workspace
blueprint_yaml: "{{ workspace }}/examples/ml-gke.yaml"
network: "{{ deployment_name }}-net"
remote_node: "{{ deployment_name }}-0"
cli_deployment_vars:
region: "{{ region }}"
gcp_public_cidrs_access_enabled: true
enable_ray_operator: true
custom_vars:
project: "{{ project }}"
post_deploy_tests:
- test-validation/test-gke-ray.yml
Loading