fix: NodePool should not late-init node count #452

moolen · 2024-01-26T11:52:23Z

The nodepool's node_count and initial_node_count should not be late-initialized to prevent a fight with the GKE autoscaler.

Also see discussion on #353

Fixes: #340

I have:

Run make reviewable test to ensure this PR is ready for review.

How has this code been tested

I created a cluster manually and used the following configuration:

apiVersion: gcp.upbound.io/v1beta1
kind: ProviderConfig
metadata:
  name: default
spec:
  projectID: external-secrets-361720
  credentials:
    source: Secret
    secretRef:
      name: provider-secret
      namespace: upbound-system
      key: credentials
---
apiVersion: container.gcp.upbound.io/v1beta1
kind: NodePool
metadata:
  labels:
    test: yep
  name: nodepool
spec:
  providerConfigRef:
    name: default
  managementPolicies: ["*"]
  initProvider:
    initialNodeCount: 1
  forProvider:
    project: external-secrets-361720
    cluster: xp-452
    autoscaling:
    - maxNodeCount: 12 # later, set this too `10` to verify we're able to update nodepool config
      minNodeCount: 1
    location: europe-north1-a
    management:
    - autoRepair: true
      autoUpgrade: true
    maxPodsPerNode: 32
    nodeConfig:
      - machineType: e2-medium

Nodepool was created, note: 12 max node count:

after it has been created, some of the late init fields are shown. Both initalNodeCount and nodeCount are not shown, as expected.

I scaled up a workload to trigger the GKE autoscaler.

Nodes join the cluster (= autoscaling is working)

I've set the autoscaling.maxNodeCount=10 to trigger a reconciliation and verify that crossplane is able to update the nodepool despite the nodeCount being changed. Proof: nodepool has changed in GCP console:

The nodepool's node_count and initial_node_count should not be late-initialized to prevent a fight with the GKE autoscaler. Signed-off-by: Moritz Johner <[email protected]>

moolen · 2024-01-26T11:55:46Z

Hey @turkenf could you please ✔️ the PR pipeline? I'd like to get an artifact out of the publish-artifacts job and do end to end testing on it 🙏

//edit thank you 🙇

turkenf · 2024-01-26T12:34:34Z

/test-examples="examples/container/nodepool.yaml"

moolen · 2024-01-26T12:49:07Z

damn, i just realized the CI job doesn't build a public accessible image. Going to publish one on my own 👾

moolen · 2024-01-26T15:31:30Z

Hey @turkenf i've updated my proof of work. Can you please guide me regarding the Uptest-examples/container/nodepool.yaml failure?

I can find the logs: Is this a flake or an issue with my change 🤔 ?

wait nodepool.container.gcp.upbound.io/nodepool --for=condition=Test --timeout 10s" exceeded 6 sec timeout, context deadline exceeded
    logger.go:42: 12:59:49 | case | Failed to collect events for case in ns kuttl-test-certain-chow: no matches for kind "Event" in version "events.k8s.io/v1beta1"

turkenf · 2024-01-28T16:40:59Z

/test-examples="examples/container/nodepool.yaml"

turkenf

Hi @moolen,

Thank you for your effort on this PR and your nice testing explanation. If you look at the PR #353, which contains the same change before, we have a managementPolicy feature to be used in these cases. Have you tried this feature? (see this comment)

Hey @turkenf i've updated my proof of work. Can you please guide me regarding the Uptest-examples/container/nodepool.yaml failure?

The default time for testing our examples in Uptest is 20 minutes, and we use the uptest.upbound.io/timeout annotation for resources that need to be tested longer than this time. I think adding an annotation to the example like here will solve the issue.

moolen · 2024-02-01T12:54:08Z

Thank you @turkenf, i got it to work with bespoke managementPolicy. I'm going to close this PR then.

fix: NodePool should not late-init node count

6819b2b

The nodepool's node_count and initial_node_count should not be late-initialized to prevent a fight with the GKE autoscaler. Signed-off-by: Moritz Johner <[email protected]>

moolen marked this pull request as ready for review January 26, 2024 15:29

moolen requested review from ulucinar, sergenyalcin and turkenf as code owners January 26, 2024 15:29

turkenf reviewed Jan 28, 2024

View reviewed changes

moolen closed this Feb 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: NodePool should not late-init node count #452

fix: NodePool should not late-init node count #452

moolen commented Jan 26, 2024 •

edited

Loading

moolen commented Jan 26, 2024 •

edited

Loading

turkenf commented Jan 26, 2024

moolen commented Jan 26, 2024

moolen commented Jan 26, 2024

turkenf commented Jan 28, 2024

turkenf left a comment

moolen commented Feb 1, 2024

fix: NodePool should not late-init node count #452

fix: NodePool should not late-init node count #452

Conversation

moolen commented Jan 26, 2024 • edited Loading

How has this code been tested

moolen commented Jan 26, 2024 • edited Loading

turkenf commented Jan 26, 2024

moolen commented Jan 26, 2024

moolen commented Jan 26, 2024

turkenf commented Jan 28, 2024

turkenf left a comment

Choose a reason for hiding this comment

moolen commented Feb 1, 2024

moolen commented Jan 26, 2024 •

edited

Loading

moolen commented Jan 26, 2024 •

edited

Loading