Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding ray operator addon support for GKE cluster creation #3583

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
74 changes: 74 additions & 0 deletions examples/ml-gke-ray.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
---

blueprint_name: ml-gke-ray

vars:
project_id: ## Set GCP Project ID Here ##
deployment_name: ml-gke-ray-01
region: asia-southeast1
zones:
- asia-southeast1-b # g2 machine has better availability in this zone
# Cidr block containing the IP of the machine calling terraform.
# The following line must be updated for this example to work.
authorized_cidr: <your-ip-address>/32
gcp_public_cidrs_access_enabled: false

deployment_groups:
- group: primary
modules:
- id: network1
source: modules/network/vpc
settings:
subnetwork_name: $(vars.deployment_name)-subnet
secondary_ranges_list:
- subnetwork_name: $(vars.deployment_name)-subnet
ranges:
- range_name: pods
ip_cidr_range: 10.4.0.0/14
- range_name: services
ip_cidr_range: 10.0.32.0/20

- id: gke_service_account
source: community/modules/project/service-account
settings:
name: gke-sa
project_roles:
- logging.logWriter
- monitoring.metricWriter
- monitoring.viewer
- stackdriver.resourceMetadata.writer
- storage.objectViewer
- artifactregistry.reader

- id: gke_cluster
source: modules/scheduler/gke-cluster
use: [network1, gke_service_account]
settings:
enable_ray_operator: true
enable_private_endpoint: false # Allows for access from authorized public IPs
gcp_public_cidrs_access_enabled: $(vars.gcp_public_cidrs_access_enabled)
master_authorized_networks:
- display_name: deployment-machine
cidr_block: $(vars.authorized_cidr)
configure_workload_identity_sa: true
outputs: [instructions]

- id: g2_pool
source: modules/compute/gke-node-pool
use: [gke_cluster, gke_service_account]
settings:
disk_type: pd-balanced
machine_type: g2-standard-4
1 change: 1 addition & 0 deletions modules/scheduler/gke-cluster/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -151,6 +151,7 @@ limitations under the License.
| <a name="input_enable_dataplane_v2"></a> [enable\_dataplane\_v2](#input\_enable\_dataplane\_v2) | Enables [Dataplane v2](https://cloud.google.com/kubernetes-engine/docs/concepts/dataplane-v2). This setting is immutable on clusters. If null, will default to false unless using multi-networking, in which case it will default to true | `bool` | `null` | no |
| <a name="input_enable_dcgm_monitoring"></a> [enable\_dcgm\_monitoring](#input\_enable\_dcgm\_monitoring) | Enable GKE to collect DCGM metrics | `bool` | `false` | no |
| <a name="input_enable_filestore_csi"></a> [enable\_filestore\_csi](#input\_enable\_filestore\_csi) | The status of the Filestore Container Storage Interface (CSI) driver addon, which allows the usage of filestore instance as volumes. | `bool` | `false` | no |
| <a name="input_enable_ray_operator"></a> [enable\_ray\_operator](#input\_enable\_ray\_operator) | The status of the Ray operator addon, This feature enables Kubernetes APIs for managing and scaling Ray clusters and jobs. You control and are responsible for managing ray.io custom resources in your cluster. This feature is not compatible with GKE clusters that already have another Ray operator installed. Supports clusters on Kubernetes version 1.29.8-gke.1054000 or later. | `bool` | `false` | no |
| <a name="input_enable_gcsfuse_csi"></a> [enable\_gcsfuse\_csi](#input\_enable\_gcsfuse\_csi) | The status of the GCSFuse Filestore Container Storage Interface (CSI) driver addon, which allows the usage of a gcs bucket as volumes. | `bool` | `false` | no |
| <a name="input_enable_master_global_access"></a> [enable\_master\_global\_access](#input\_enable\_master\_global\_access) | Whether the cluster master is accessible globally (from any region) or only within the same region as the private endpoint. | `bool` | `false` | no |
| <a name="input_enable_multi_networking"></a> [enable\_multi\_networking](#input\_enable\_multi\_networking) | Enables [multi networking](https://cloud.google.com/kubernetes-engine/docs/how-to/setup-multinetwork-support-for-pods#create-a-gke-cluster) (Requires GKE Enterprise). This setting is immutable on clusters and enables [Dataplane V2](https://cloud.google.com/kubernetes-engine/docs/concepts/dataplane-v2?hl=en). If null, will determine state based on if additional\_networks are passed in. | `bool` | `null` | no |
Expand Down
3 changes: 3 additions & 0 deletions modules/scheduler/gke-cluster/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -205,6 +205,9 @@ resource "google_container_cluster" "gke_cluster" {
parallelstore_csi_driver_config {
enabled = var.enable_parallelstore_csi
}
ray_operator_config {
enabled = var.enable_ray_operator
}
}

timeouts {
Expand Down
6 changes: 6 additions & 0 deletions modules/scheduler/gke-cluster/variables.tf
Original file line number Diff line number Diff line change
Expand Up @@ -145,6 +145,12 @@ variable "enable_parallelstore_csi" {
default = false
}

variable "enable_ray_operator" {
description = "The status of the Ray operator addon, This feature enables Kubernetes APIs for managing and scaling Ray clusters and jobs. You control and are responsible for managing ray.io custom resources in your cluster. This feature is not compatible with GKE clusters that already have another Ray operator installed. Supports clusters on Kubernetes version 1.29.8-gke.1054000 or later."
type = bool
default = false
}

variable "enable_dcgm_monitoring" {
description = "Enable GKE to collect DCGM metrics"
type = bool
Expand Down
Loading