Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Autoscale vtgate replicas using HPA #3945

Closed
wants to merge 11 commits into from
Closed

Conversation

siadat
Copy link
Member

@siadat siadat commented Sep 3, 2024

Problem or Feature

Note: This PR will be unnecessary if the following one is merged: planetscale/vitess-operator#598

We want to auto-scale vtgate using HPA. However, there are 2 problems:

  1. VitessCells are created and reconciled by the vitess-operator. So, if we autoscale number of replicas in a cell, the value will be overwritten by the operator and copied from VitessCluster. Therefore, we need to do one of these things:
    • a. either create cells that are unmanaged by clusters (ie we create VitessCells, instead of having vitess-operator create them from VitessClusters),
    • b. or autoscale replicas in a VitessCluster in some other way
  2. VitessCell (where vtgate is specified) is not scalable with HPA, because it does not define the /scale subresource.

Solution

Therefore, this PR is doing the following:

  1. Create a CR whose only purpose is to autoscale vtgate, call it ScalableVtgate
  2. Create an HPA to auto-scale that ^
  3. When reconciling the VitessCluster, read the number of replicas from ScalableVtgate

Context

@siadat siadat marked this pull request as ready for review September 5, 2024 16:26
Copy link
Member

@nemacysts nemacysts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we'll also probably want to add a schema for the vitessclusters files at some point (https://github.com/Yelp/paasta/tree/master/paasta_tools/cli/schemas)

Comment on lines +427 to +433
if crd.file_prefix in INSTANCE_TYPE_TO_RELATED_OBJECTS_UPDATER:
INSTANCE_TYPE_TO_RELATED_OBJECTS_UPDATER[crd.file_prefix](
service=service,
instance=inst,
cluster=cluster,
kube_client=kube_client,
soa_dir=DEFAULT_SOA_DIR,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'd want a log message right before this as in

log.info(f"Ensuring related API objects for {app} are in sync")
re: updating additional objects

I think I also need to take a closer look at what {setup,cleanup}_kubernetes_cr() currently do and see if there's anything we need to worry about there re: a CRD that's not created with create_custom_resource()

@EvanKrall @Rob-Johnson I'm also a little unsure about what the best way to do this is - I'm not sure if we'd want to have this sort of conditional logic based on the instance type or if we'd want to to something at the base class that all these instance types use so that updating CRs that aren't directly tied to a CRD that paasta discovers

soa_dir: str = DEFAULT_SOA_DIR,
) -> VitessDeploymentConfigDict:
vitess_service_instance_configs = load_vitess_instance_config(
service, instance, cluster, soa_dir=soa_dir
).get_vitess_config()
).get_vitess_config(kube_client=kube_client)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

afaik, we want to avoid requiring kubernetes permissions/clients to load configs - folks without k8s admin permissions can write scripts that read all instance configs, and this would break such scripts

(also, in general, creating the config dicts should be doable without talking to kubernetes as it should be a pretty pure operation (i.e., the only input should be soaconfigs and there should be no mutations or anything))

Comment on lines +1138 to +1164
for cell in vitess_config["cells"]:
if not self.is_vtgate_autoscaling_enabled(cell):
continue

name = self.get_vtgate_hpa_name(cell["name"])
scalablevtgate_cr = None
try:
scalablevtgate_cr = kube_client.custom.get_namespaced_custom_object(
group=SCALABLEVTGATE_CRD["group"],
version=SCALABLEVTGATE_CRD["version"],
namespace=self.get_namespace(),
plural=SCALABLEVTGATE_CRD["plural"],
name=name,
)
except ApiException as e:
if e.status != 404:
raise e
else:
log.error(f"{SCALABLEVTGATE_CRD['kind']} {name} not found")

if not scalablevtgate_cr:
log.error(f"{SCALABLEVTGATE_CRD['kind']} {name} not found")
continue

autoscaled_replicas = scalablevtgate_cr["spec"].get("replicas")
if autoscaled_replicas is not None:
cell["gateway"]["replicas"] = autoscaled_replicas
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, i'm not quite sure where we wanna put this right now such that we can still construct the instance config in the absence of a kube client, but still run this logic when actually talking to k8s to create resources

return

def get_vtgate_hpa_name(self, cell_name: str) -> str:
return f"{sanitised_cr_name(self.service, self.instance)}-cell-{cell_name}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to sanitize cell_name with sanitise_kubernetes_name()? additionally, how long can these strings be? do we need to truncate them to ensure that we stay below the k8s length limits?

> 0
)

if should_exist:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we probably want to add some log messages here to make it clear when/what changes are happening

],
},
"spec": {
"selector": ",".join([f"{k}={v}" for k, v in pod_selector.items()]),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a note: if any of the label values these selectors are filtering on are run through sanitise_kubernetes_name(), we'll need to ensure that the values in the selector dict are also passed to sanitise_kubernetes_name()

@siadat siadat closed this Oct 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants