-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature: Add ability to drain nodes #249
Comments
I could see this being a resource that takes in the node id (either read out of the started VM, or by provisioning in the node ID into the VM using a random_uuid data source) that could optionally wait for the node to come online - technically not provisioning the node (the node registers itself after all) but would validate that the node came up properly which is an improvement on current workflows which have no way of checking the node has come up properly. Then on destroy it drains the node and marks ineligible and waits for the drain to complete. |
Hi @nvx 👋 This is an interesting, but I'm not sure if it fits well with the Terraform model 🤔 For example, what would it mean to delete this resource? We can't really "undo" a node drain so the resource would kind of just hang there? These types of one-off operations seem better suited for the Nomad CLI. |
Marking as eligible and ineligible for scheduling is a reversible operation, I'd envisage triggering a drain as an option when marking ineligible. While the Nomad CLI can be used for this, it means that the lifecycle of a Nomad worker can no longer be maintained through Terraform alone as you'd need to drain a worker manually before having Terraform do the rest of the lifecycle management which can get a bit onerous when doing immutable blue/green deployments where destroying resources is a BAU activity. We've tried to emulate it using destroy time exec provisioners to call the Nomad CLI but it gets super ugly trying to access credentials to perform the operation (our Nomad tokens are read out of Vault and short lived as per Terraform best practices to reduce the risk of secrets in the state file). |
That's a different endpoint, so probably another resource altogether 🙂 So Changing eligibility on delete seems error prone, so I think a no-op would be better. Would you be interested in working on a PR for them? |
I just had a closer look what the Nomad CLI is doing (to date I've been draining nodes manually), and it looks like the logic is enabling drain on a node sets the eligibility to false by default (but you can also set the eligibility to false without doing a drain), and cancelling an existing drain changes the eligibility back to true by default (but you can also do this separately to the drain as well, or cancel a running drain without changing the eligibility back). I guess in my head I had conflated the drain and eligibility flags together since normally I'm dealing with them together (indeed I don't think you can start a drain on a host without also marking it ineligible at the same time).
I'm not quite sure I follow what you mean? When I'm provisioning a Nomad worker with Terraform, currently I'm generating a UUID in Terraform and injecting it into the VM to be the Nomad worker UUID. Having a resource that depends on the VM resource that talks to Nomad and verifies the worker has come up and is marked as eligible (which should be the default for a newly joined agent) would be how I see it working. Then on destroy it would mark the worker ineligible and trigger a drain, once the drain completes it would be marked as done allowing for the next step in the destroy plan to complete (eg, destroying the VM). I guess I see it more as a resource that manages the lifecycle of the Nomad worker, of draining on destroy is an optional parameter. The work during the create isn't just a no-op either since it would check that a Nomad agent came up with the provided UUID which verifies that the VM came up properly even if the amount of actual configuration it needs to do is minimal.
Definitely, if we can figure out a pattern for how to achieve the lifecycle in Terraform I'd be glad to make a PR implementing it. |
Ahh now I see what you mean, thanks! Usually resources in Terraform providers match 1:1 with a set of remote API CRUDs, but these endpoints (like Your idea is that this resource would be like a sidecar to another resource, like an AWS EC2 instance. It would need a way to connect the two. You mentioned that you're generating UUIDs, but you will probably need an explicit If I understood correctly, this is an example of how this resource would be used: resource "aws_instance" "nomad_client_1" {
# ...
}
resource "nomad_node_manager" "nomad_client_1" {
drain_on_destroy = true
drain_deadline = "10m"
drain_ignore_system_jobs = false
drain_meta = {
triggered_by = "Terraform"
}
eligible = true
filter {
type = "id"
values = [ ... ]
}
filter {
type = "meta"
key = "unique.platform.aws.instance-id"
values = [aws_instance.nomad_client_1.id]
}
}
On create, this resource would wait until all filters provided match a registered node in the Nomad On read, the resource would just check the IDs in the Terraform state to see if the nodes are still registered. On update, the filters must be checked again, so very similar to create. But it would also check the On destroy, the resource checks the values of How does this sounds? Is it similar to what you had in mind? |
Yup that's exactly it. I like your idea of using the node metadata to filter on it as well, would be handy where the cloud platform provides that information. A little more flexible than having to explicitly provision in the node ID ahead of time. A timeout for how long to wait on the create for the Node to appear before considering it failed might be worth adding too. |
There's a standard way to handle timeouts in Terraform: So it's mostly a matter of making sure that Would you still be interested in a submitting a PR with this spec? |
Good to know
Definitely. Just a matter of getting some free time to look at it. This time of year ends up being rather hectic unfortunately. |
No worries! We'll be around to help whenever you're ready to work on it 🙂 |
When provisioning nomad client nodes with Terraform, it would be useful to be able to use the Nomad terraform provider to gracefully drain (including waiting on the drain to complete) before destroying the VM.
I've gotten close by using a destroy time exec provisioner in a null_resource, but the problem is it is not possible to access a fresh Nomad ACL token. Having it as a feature of the provider fixes this issue (as well as reduces the brittleness inherent to exec provisioners) as providers can reference credentials from data sources directly that refresh during the plan.
The text was updated successfully, but these errors were encountered: