Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: NodePool should not late-init node count #452

Closed

Conversation

moolen
Copy link

@moolen moolen commented Jan 26, 2024

The nodepool's node_count and initial_node_count should not be late-initialized to prevent a fight with the GKE autoscaler.

Also see discussion on #353

Fixes: #340

I have:

  • Run make reviewable test to ensure this PR is ready for review.

How has this code been tested

I created a cluster manually and used the following configuration:

apiVersion: gcp.upbound.io/v1beta1
kind: ProviderConfig
metadata:
  name: default
spec:
  projectID: external-secrets-361720
  credentials:
    source: Secret
    secretRef:
      name: provider-secret
      namespace: upbound-system
      key: credentials
---
apiVersion: container.gcp.upbound.io/v1beta1
kind: NodePool
metadata:
  labels:
    test: yep
  name: nodepool
spec:
  providerConfigRef:
    name: default
  managementPolicies: ["*"]
  initProvider:
    initialNodeCount: 1
  forProvider:
    project: external-secrets-361720
    cluster: xp-452
    autoscaling:
    - maxNodeCount: 12 # later, set this too `10` to verify we're able to update nodepool config
      minNodeCount: 1
    location: europe-north1-a
    management:
    - autoRepair: true
      autoUpgrade: true
    maxPodsPerNode: 32
    nodeConfig:
      - machineType: e2-medium

Nodepool was created, note: 12 max node count:
Screenshot from 2024-01-26 16-12-53

after it has been created, some of the late init fields are shown. Both initalNodeCount and nodeCount are not shown, as expected.
Screenshot from 2024-01-26 16-15-08

I scaled up a workload to trigger the GKE autoscaler.
Screenshot from 2024-01-26 16-20-33

Nodes join the cluster (= autoscaling is working)
Screenshot from 2024-01-26 16-20-52

I've set the autoscaling.maxNodeCount=10 to trigger a reconciliation and verify that crossplane is able to update the nodepool despite the nodeCount being changed. Proof: nodepool has changed in GCP console:

Screenshot from 2024-01-26 16-22-50

The nodepool's node_count and initial_node_count should not be
late-initialized to prevent a fight with the GKE autoscaler.

Signed-off-by: Moritz Johner <[email protected]>
@moolen
Copy link
Author

moolen commented Jan 26, 2024

Hey @turkenf could you please ✔️ the PR pipeline? I'd like to get an artifact out of the publish-artifacts job and do end to end testing on it 🙏

//edit thank you 🙇

@turkenf
Copy link
Collaborator

turkenf commented Jan 26, 2024

/test-examples="examples/container/nodepool.yaml"

@moolen
Copy link
Author

moolen commented Jan 26, 2024

damn, i just realized the CI job doesn't build a public accessible image. Going to publish one on my own 👾

@moolen moolen marked this pull request as ready for review January 26, 2024 15:29
@moolen
Copy link
Author

moolen commented Jan 26, 2024

Hey @turkenf i've updated my proof of work. Can you please guide me regarding the Uptest-examples/container/nodepool.yaml failure?

I can find the logs: Is this a flake or an issue with my change 🤔 ?

wait nodepool.container.gcp.upbound.io/nodepool --for=condition=Test --timeout 10s" exceeded 6 sec timeout, context deadline exceeded
    logger.go:42: 12:59:49 | case | Failed to collect events for case in ns kuttl-test-certain-chow: no matches for kind "Event" in version "events.k8s.io/v1beta1"

@turkenf
Copy link
Collaborator

turkenf commented Jan 28, 2024

/test-examples="examples/container/nodepool.yaml"

Copy link
Collaborator

@turkenf turkenf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @moolen,

Thank you for your effort on this PR and your nice testing explanation. If you look at the PR #353, which contains the same change before, we have a managementPolicy feature to be used in these cases. Have you tried this feature? (see this comment)

Hey @turkenf i've updated my proof of work. Can you please guide me regarding the Uptest-examples/container/nodepool.yaml failure?

The default time for testing our examples in Uptest is 20 minutes, and we use the uptest.upbound.io/timeout annotation for resources that need to be tested longer than this time. I think adding an annotation to the example like here will solve the issue.

@moolen
Copy link
Author

moolen commented Feb 1, 2024

Thank you @turkenf, i got it to work with bespoke managementPolicy. I'm going to close this PR then.

@moolen moolen closed this Feb 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

NodePool container.gcp.upbound.io should not LateInitialize node_count
2 participants