Skip to content

Commit

Permalink
add future reservation support
Browse files Browse the repository at this point in the history
  • Loading branch information
abbas1902 committed Nov 11, 2024
1 parent b857af0 commit c88c686
Show file tree
Hide file tree
Showing 11 changed files with 155 additions and 42 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -178,6 +178,7 @@ No modules.
| <a name="input_enable_shielded_vm"></a> [enable\_shielded\_vm](#input\_enable\_shielded\_vm) | Enable the Shielded VM configuration. Note: the instance image must support option. | `bool` | `false` | no |
| <a name="input_enable_smt"></a> [enable\_smt](#input\_enable\_smt) | Enables Simultaneous Multi-Threading (SMT) on instance. | `bool` | `false` | no |
| <a name="input_enable_spot_vm"></a> [enable\_spot\_vm](#input\_enable\_spot\_vm) | Enable the partition to use spot VMs (https://cloud.google.com/spot-vms). | `bool` | `false` | no |
| <a name="input_future_reservation"></a> [future\_reservation](#input\_future\_reservation) | Allows for the use of future reservations. Input can either be the future reservation name or full selfLink.<br/>See https://cloud.google.com/compute/docs/instances/future-reservations-overview | `string` | `null` | no |
| <a name="input_guest_accelerator"></a> [guest\_accelerator](#input\_guest\_accelerator) | List of the type and count of accelerator cards attached to the instance. | <pre>list(object({<br/> type = string,<br/> count = number<br/> }))</pre> | `[]` | no |
| <a name="input_instance_image"></a> [instance\_image](#input\_instance\_image) | Defines the image that will be used in the Slurm node group VM instances.<br/><br/>Expected Fields:<br/>name: The name of the image. Mutually exclusive with family.<br/>family: The image family to use. Mutually exclusive with name.<br/>project: The project where the image is hosted.<br/><br/>For more information on creating custom images that comply with Slurm on GCP<br/>see the "Slurm on GCP Custom Images" section in docs/vm-images.md. | `map(string)` | <pre>{<br/> "family": "slurm-gcp-6-8-hpc-rocky-linux-8",<br/> "project": "schedmd-slurm-public"<br/>}</pre> | no |
| <a name="input_instance_image_custom"></a> [instance\_image\_custom](#input\_instance\_image\_custom) | A flag that designates that the user is aware that they are requesting<br/>to use a custom and potentially incompatible image for this Slurm on<br/>GCP module.<br/><br/>If the field is set to false, only the compatible families and project<br/>names will be accepted. The deployment will fail with any other image<br/>family or name. If set to true, no checks will be done.<br/><br/>See: https://goo.gle/hpc-slurm-images | `bool` | `false` | no |
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -95,6 +95,7 @@ locals {
spot = var.enable_spot_vm
termination_action = try(var.spot_instance_config.termination_action, null)
reservation_name = local.reservation_name
future_reservation = var.future_reservation
maintenance_interval = var.maintenance_interval
instance_properties_json = jsonencode(var.instance_properties)

Expand Down
15 changes: 15 additions & 0 deletions community/modules/compute/schedmd-slurm-gcp-v6-nodeset/outputs.tf
Original file line number Diff line number Diff line change
Expand Up @@ -54,4 +54,19 @@ output "nodeset" {
condition = !var.enable_placement || !var.dws_flex.enabled
error_message = "Cannot use DWS Flex with `enable_placement`."
}

precondition {
condition = var.reservation_name == "" || var.future_reservation == null
error_message = "Cannot use reservations and future reservations in the same nodeset"
}

precondition {
condition = var.node_count_dynamic_max == 0 || var.future_reservation == null
error_message = "Only static nodes can be used with future reservations"
}

precondition {
condition = !var.enable_placement || var.future_reservation == null
error_message = "Cannot use `enable_placement` with future reservations."
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -463,6 +463,15 @@ variable "reservation_name" {
}
}

variable "future_reservation" {
description = <<-EOD
Allows for the use of future reservations. Input can either be the future reservation name or full selfLink.
See https://cloud.google.com/compute/docs/instances/future-reservations-overview
EOD
type = string
default = null
}

variable "maintenance_interval" {
description = <<-EOD
Sets the maintenance interval for instances in this nodeset.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -313,7 +313,7 @@ limitations under the License.
| <a name="input_metadata"></a> [metadata](#input\_metadata) | Metadata, provided as a map. | `map(string)` | `{}` | no |
| <a name="input_min_cpu_platform"></a> [min\_cpu\_platform](#input\_min\_cpu\_platform) | Specifies a minimum CPU platform. Applicable values are the friendly names of<br/>CPU platforms, such as Intel Haswell or Intel Skylake. See the complete list:<br/>https://cloud.google.com/compute/docs/instances/specify-min-cpu-platform | `string` | `null` | no |
| <a name="input_network_storage"></a> [network\_storage](#input\_network\_storage) | An array of network attached storage mounts to be configured on all instances. | <pre>list(object({<br/> server_ip = string,<br/> remote_mount = string,<br/> local_mount = string,<br/> fs_type = string,<br/> mount_options = string,<br/> client_install_runner = optional(map(string))<br/> mount_runner = optional(map(string))<br/> }))</pre> | `[]` | no |
| <a name="input_nodeset"></a> [nodeset](#input\_nodeset) | Define nodesets, as a list. | <pre>list(object({<br/> node_count_static = optional(number, 0)<br/> node_count_dynamic_max = optional(number, 1)<br/> node_conf = optional(map(string), {})<br/> nodeset_name = string<br/> additional_disks = optional(list(object({<br/> disk_name = optional(string)<br/> device_name = optional(string)<br/> disk_size_gb = optional(number)<br/> disk_type = optional(string)<br/> disk_labels = optional(map(string), {})<br/> auto_delete = optional(bool, true)<br/> boot = optional(bool, false)<br/> })), [])<br/> bandwidth_tier = optional(string, "platform_default")<br/> can_ip_forward = optional(bool, false)<br/> disable_smt = optional(bool, false)<br/> disk_auto_delete = optional(bool, true)<br/> disk_labels = optional(map(string), {})<br/> disk_size_gb = optional(number)<br/> disk_type = optional(string)<br/> enable_confidential_vm = optional(bool, false)<br/> enable_placement = optional(bool, false)<br/> enable_oslogin = optional(bool, true)<br/> enable_shielded_vm = optional(bool, false)<br/> enable_maintenance_reservation = optional(bool, true)<br/> gpu = optional(object({<br/> count = number<br/> type = string<br/> }))<br/> dws_flex = object({<br/> enabled = bool<br/> max_run_duration = number<br/> use_job_duration = bool<br/> })<br/> labels = optional(map(string), {})<br/> machine_type = optional(string)<br/> maintenance_interval = optional(string)<br/> instance_properties_json = string<br/> metadata = optional(map(string), {})<br/> min_cpu_platform = optional(string)<br/> network_tier = optional(string, "STANDARD")<br/> network_storage = optional(list(object({<br/> server_ip = string<br/> remote_mount = string<br/> local_mount = string<br/> fs_type = string<br/> mount_options = string<br/> client_install_runner = optional(map(string))<br/> mount_runner = optional(map(string))<br/> })), [])<br/> on_host_maintenance = optional(string)<br/> preemptible = optional(bool, false)<br/> region = optional(string)<br/> service_account = optional(object({<br/> email = optional(string)<br/> scopes = optional(list(string), ["https://www.googleapis.com/auth/cloud-platform"])<br/> }))<br/> shielded_instance_config = optional(object({<br/> enable_integrity_monitoring = optional(bool, true)<br/> enable_secure_boot = optional(bool, true)<br/> enable_vtpm = optional(bool, true)<br/> }))<br/> source_image_family = optional(string)<br/> source_image_project = optional(string)<br/> source_image = optional(string)<br/> subnetwork_self_link = string<br/> additional_networks = optional(list(object({<br/> network = string<br/> subnetwork = string<br/> subnetwork_project = string<br/> network_ip = string<br/> nic_type = string<br/> stack_type = string<br/> queue_count = number<br/> access_config = list(object({<br/> nat_ip = string<br/> network_tier = string<br/> }))<br/> ipv6_access_config = list(object({<br/> network_tier = string<br/> }))<br/> alias_ip_range = list(object({<br/> ip_cidr_range = string<br/> subnetwork_range_name = string<br/> }))<br/> })))<br/> access_config = optional(list(object({<br/> nat_ip = string<br/> network_tier = string<br/> })))<br/> spot = optional(bool, false)<br/> tags = optional(list(string), [])<br/> termination_action = optional(string)<br/> reservation_name = optional(string)<br/> startup_script = optional(list(object({<br/> filename = string<br/> content = string })), [])<br/><br/> zone_target_shape = string<br/> zone_policy_allow = set(string)<br/> zone_policy_deny = set(string)<br/> }))</pre> | `[]` | no |
| <a name="input_nodeset"></a> [nodeset](#input\_nodeset) | Define nodesets, as a list. | <pre>list(object({<br/> node_count_static = optional(number, 0)<br/> node_count_dynamic_max = optional(number, 1)<br/> node_conf = optional(map(string), {})<br/> nodeset_name = string<br/> additional_disks = optional(list(object({<br/> disk_name = optional(string)<br/> device_name = optional(string)<br/> disk_size_gb = optional(number)<br/> disk_type = optional(string)<br/> disk_labels = optional(map(string), {})<br/> auto_delete = optional(bool, true)<br/> boot = optional(bool, false)<br/> })), [])<br/> bandwidth_tier = optional(string, "platform_default")<br/> can_ip_forward = optional(bool, false)<br/> disable_smt = optional(bool, false)<br/> disk_auto_delete = optional(bool, true)<br/> disk_labels = optional(map(string), {})<br/> disk_size_gb = optional(number)<br/> disk_type = optional(string)<br/> enable_confidential_vm = optional(bool, false)<br/> enable_placement = optional(bool, false)<br/> enable_oslogin = optional(bool, true)<br/> enable_shielded_vm = optional(bool, false)<br/> enable_maintenance_reservation = optional(bool, true)<br/> gpu = optional(object({<br/> count = number<br/> type = string<br/> }))<br/> dws_flex = object({<br/> enabled = bool<br/> max_run_duration = number<br/> use_job_duration = bool<br/> })<br/> labels = optional(map(string), {})<br/> machine_type = optional(string)<br/> maintenance_interval = optional(string)<br/> instance_properties_json = string<br/> metadata = optional(map(string), {})<br/> min_cpu_platform = optional(string)<br/> network_tier = optional(string, "STANDARD")<br/> network_storage = optional(list(object({<br/> server_ip = string<br/> remote_mount = string<br/> local_mount = string<br/> fs_type = string<br/> mount_options = string<br/> client_install_runner = optional(map(string))<br/> mount_runner = optional(map(string))<br/> })), [])<br/> on_host_maintenance = optional(string)<br/> preemptible = optional(bool, false)<br/> region = optional(string)<br/> service_account = optional(object({<br/> email = optional(string)<br/> scopes = optional(list(string), ["https://www.googleapis.com/auth/cloud-platform"])<br/> }))<br/> shielded_instance_config = optional(object({<br/> enable_integrity_monitoring = optional(bool, true)<br/> enable_secure_boot = optional(bool, true)<br/> enable_vtpm = optional(bool, true)<br/> }))<br/> source_image_family = optional(string)<br/> source_image_project = optional(string)<br/> source_image = optional(string)<br/> subnetwork_self_link = string<br/> additional_networks = optional(list(object({<br/> network = string<br/> subnetwork = string<br/> subnetwork_project = string<br/> network_ip = string<br/> nic_type = string<br/> stack_type = string<br/> queue_count = number<br/> access_config = list(object({<br/> nat_ip = string<br/> network_tier = string<br/> }))<br/> ipv6_access_config = list(object({<br/> network_tier = string<br/> }))<br/> alias_ip_range = list(object({<br/> ip_cidr_range = string<br/> subnetwork_range_name = string<br/> }))<br/> })))<br/> access_config = optional(list(object({<br/> nat_ip = string<br/> network_tier = string<br/> })))<br/> spot = optional(bool, false)<br/> tags = optional(list(string), [])<br/> termination_action = optional(string)<br/> reservation_name = optional(string)<br/> future_reservation = string<br/> startup_script = optional(list(object({<br/> filename = string<br/> content = string })), [])<br/><br/> zone_target_shape = string<br/> zone_policy_allow = set(string)<br/> zone_policy_deny = set(string)<br/> }))</pre> | `[]` | no |
| <a name="input_nodeset_dyn"></a> [nodeset\_dyn](#input\_nodeset\_dyn) | Defines dynamic nodesets, as a list. | <pre>list(object({<br/> nodeset_name = string<br/> nodeset_feature = string<br/> }))</pre> | `[]` | no |
| <a name="input_nodeset_tpu"></a> [nodeset\_tpu](#input\_nodeset\_tpu) | Define TPU nodesets, as a list. | <pre>list(object({<br/> node_count_static = optional(number, 0)<br/> node_count_dynamic_max = optional(number, 5)<br/> nodeset_name = string<br/> enable_public_ip = optional(bool, false)<br/> node_type = string<br/> accelerator_config = optional(object({<br/> topology = string<br/> version = string<br/> }), {<br/> topology = ""<br/> version = ""<br/> })<br/> tf_version = string<br/> preemptible = optional(bool, false)<br/> preserve_tpu = optional(bool, false)<br/> zone = string<br/> data_disks = optional(list(string), [])<br/> docker_image = optional(string, "")<br/> network_storage = optional(list(object({<br/> server_ip = string<br/> remote_mount = string<br/> local_mount = string<br/> fs_type = string<br/> mount_options = string<br/> client_install_runner = optional(map(string))<br/> mount_runner = optional(map(string))<br/> })), [])<br/> subnetwork = string<br/> service_account = optional(object({<br/> email = optional(string)<br/> scopes = optional(list(string), ["https://www.googleapis.com/auth/cloud-platform"])<br/> }))<br/> project_id = string<br/> reserved = optional(string, false)<br/> }))</pre> | `[]` | no |
| <a name="input_on_host_maintenance"></a> [on\_host\_maintenance](#input\_on\_host\_maintenance) | Instance availability Policy. | `string` | `"MIGRATE"` | no |
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -77,24 +77,11 @@ def instance_properties(nodeset:object, model:str, placement_group:Optional[str]
props.scheduling.onHostMaintenance = "TERMINATE"
props.resourcePolicies = [placement_group]

if reservation := lookup().nodeset_reservation(nodeset):
props.reservationAffinity = {
"consumeReservationType": "SPECIFIC_RESERVATION",
"key": f"compute.{util.universe_domain()}/reservation-name",
"values": [reservation.bulk_insert_name],
}
if reservation := lookup().nodeset_reservation(nodeset.reservation_name,nodeset.zone_policy_allow):
update_reservation_props(reservation, props)

if reservation.policies:
props.scheduling.onHostMaintenance = "TERMINATE"
props.resourcePolicies = reservation.policies
log.info(
f"reservation {reservation.bulk_insert_name} is being used with policies {props.resourcePolicies}"
)
else:
props.resourcePolicies = []
log.info(
f"reservation {reservation.bulk_insert_name} is being used without any policies"
)
if nodeset.future_reservation:
use_future_reservation(props, nodeset)

if nodeset.maintenance_interval:
props.scheduling.maintenanceInterval = nodeset.maintenance_interval
Expand All @@ -104,9 +91,35 @@ def instance_properties(nodeset:object, model:str, placement_group:Optional[str]

# Override with properties explicit specified in the nodeset
props.update(nodeset.get("instance_properties") or {})

return props

def update_reservation_props(reservation:Optional[object], props:object) -> None:
if not reservation:
return

props.reservationAffinity = {
"consumeReservationType": "SPECIFIC_RESERVATION",
"key": f"compute.{util.universe_domain()}/reservation-name",
"values": [reservation.bulk_insert_name],
}

if reservation.policies:
props.scheduling.onHostMaintenance = "TERMINATE"
props.resourcePolicies = reservation.policies
log.info(
f"reservation {reservation.bulk_insert_name} is being used with policies {props.resourcePolicies}"
)
else:
props.resourcePolicies = []
log.info(
f"reservation {reservation.bulk_insert_name} is being used without any policies"
)

def use_future_reservation(props:object, nodeset:object) -> None:
if (future_reservation := lookup().future_reservation(nodeset.future_reservation, nodeset.zone_policy_allow)) and future_reservation.nodesAreActive and future_reservation.specificReservationRequired:
update_reservation_props(future_reservation.activeReservation, props)

def update_props_dws(props:object, dws_flex:object, job_id: Optional[int]) -> None:
props.scheduling.onHostMaintenance = "TERMINATE"
props.scheduling.instanceTerminationAction = "DELETE"
Expand All @@ -122,7 +135,6 @@ def dws_flex_duration(dws_flex:object, job_id: Optional[int]) -> int:
log.info("Job TimeLimit cannot be less than 30 seconds or exceed 2 weeks")
return max_duration


def per_instance_properties(node):
props = NSDict()
# No properties beyond name are supported yet.
Expand Down
Loading

0 comments on commit c88c686

Please sign in to comment.