Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

node-group-auto-discovery support for oci #7403

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

gvnc
Copy link
Contributor

@gvnc gvnc commented Oct 16, 2024

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR provides oci support for node-group-auto-discovery parameter. ClusterAutoscaler will look for the nodepools in given compartment and match the nodepool tags. If tags are matched, the nodepool will be used for autoscaling. If tags do not match, nodepool will be ignored.

Which issue(s) this PR fixes:

Special notes for your reviewer:

This functionality was tested on OCI. Please find extended logs from the test.


I1016 10:02:11.736204       1 oci_manager.go:340] node group auto discovery spec constructed: &{manager:<nil> kubeClient:<nil> clusterId:ocid1.clusterinteg.oc1.phx.aaaaaaaagju3g2ukus57t4spw7u4fcmdiozwzgamnzq46fdozcv1234567 compartmentId:ocid1.compartment.oc1..aaaaaaaacciywjzae6gctocqzgiah6go4qay2phl2aoepwq6kv42xratkadq tags:map[foo:bar nmsp.ca-managed:true] minSize:1 maxSize:5}

W1016 10:02:12.921044       1 oci_manager.go:225] nodepool ignored as the tags do not satisfy the requirement : ocid1.nodepoolinteg.oc1.phx.aaaaaaaaunlcuncyrpqm7u6x6gw7ihycwkybiqddsug2wqrsinktn3dnllda , map[tag1:1234]

I1016 10:02:12.921103       1 oci_manager.go:223] auto discovered nodepool in compartment : ocid1.compartment.oc1..aaaaaaaacciywjzae6gctocqzgiah6go4qay2phl2aoepwq6kv42xratkadq , nodepoolid: ocid1.nodepoolinteg.oc1.phx.aaaaaaaaxq7iksc5y5hbpahfcdr3xjbif2mfzf7n45sbzec32nc2tlax4hza

W1016 10:02:12.921153       1 oci_manager.go:225] nodepool ignored as the  tags do not satisfy the requirement : ocid1.nodepoolinteg.oc1.phx.aaaaaaaahtj2m647kg7oqtb6rpeckaj7olvuyzl5kqejumbsmn2ly6w7rfbq , map[testTag:testValue]

Does this PR introduce a user-facing change?

Added OCI support for **node-group-auto-discovery** parameter. 

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

- [Usage]: The parameter should have a value in the pattern;
- `clusterId:<clusterId>,compartmentId:<compartmentId>,nodepoolTags:<tagKey1>=<tagValue1>&<tagKey2>=<tagValue2>,min:<min>,max:<max>`

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Oct 16, 2024
@k8s-ci-robot k8s-ci-robot added the area/provider/oci Issues or PRs related to oci provider label Oct 16, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @gvnc. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 16, 2024
@aleksandra-malinowska
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 16, 2024
| `debugging-snapshot-enabled` | Whether the debugging snapshot of cluster autoscaler feature is enabled. | false
| `node-delete-delay-after-taint` | How long to wait before deleting a node after tainting it. | 5 seconds
| `enable-provisioning-requests` | Whether the clusterautoscaler will be handling the ProvisioningRequest CRs. | false
| Parameter | Description | Default |
Copy link
Contributor

@jlamillan jlamillan Oct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think want to edit/reformat this file since it is outside the provider directory.

Copy link
Contributor Author

@gvnc gvnc Oct 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reverted this file back to its initial state. Instead, I updated README under oci folder.

@@ -153,8 +153,8 @@ func BuildOCI(opts config.AutoscalingOptions, do cloudprovider.NodeGroupDiscover
if err != nil {
klog.Fatalf("Failed to get pool type: %v", err)
}
if strings.HasPrefix(ocidType, npconsts.OciNodePoolResourceIdent) {
manager, err := nodepools.CreateNodePoolManager(opts.CloudConfig, do, createKubeClient(opts))
if strings.HasPrefix(ocidType, npconsts.OciNodePoolResourceIdent) || opts.NodeGroupAutoDiscovery != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have two implementations of this provider based on whether ocid1.nodepool... or ocid1.instancepool... resources were specified via --nodes param.

We want to give ourselves the option of supporting auto-discovery for bothnodepool and instancepool implementations, which means we need to be able to differentiate between the two in the --node-group-auto-discovery format.

Other cloud providers have added a label in the auto-discovery string to differentiate between different scaling group types e.g. AWS=>asg, GCE=mig, etc.

Maybe clusterId and nodepoolTags is already sufficient to clue us in that the implementation is OKE / nodepools. If that's the case, it's better to explicitly check for that here rather than assuming e.g.

if strings.HasPrefix(ocidType, npconsts.OciNodePoolResourceIdent) || hasNodeGroupAutoDiscovery() {

Later, instancepool could follow the pattern and do something like this:

else if strings.HasPrefix(ocidType, ipconsts.OciInstancePoolResourceIdent) || hasInstancePoolAutoDiscovery() {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for highlighting this, I was unaware of the instancepool part and more focused on nodepools, I will come up with a fix for the instancepool.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify, we don't have to actually implement auto-discovery for instance-pools in this PR. We just want to make sure to account for each implementation since hasNodeGroupAutoDiscovery() could be true with either.

Copy link
Contributor Author

@gvnc gvnc Oct 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added the suggested validation. Even though this change doesn't have the implementation for instancepools at the moment, I assumed there would be a parameter called instancepoolTags.
And the validation method would check either nodepoolTags or instancepoolTags were used in nodeGroupAutoDiscovery but not both of them at the same time.

	_, nodepoolTagsFound, err := ocicommon.HasNodeGroupTags(opts.NodeGroupAutoDiscovery)
	if err != nil {
		klog.Fatalf("Failed to get auto discovery tags: %v", err)
	}
        if strings.HasPrefix(ocidType, npconsts.OciNodePoolResourceIdent) && nodepoolTagsFound == true {
		klog.Fatalf("-nodes and -node-group-auto-discovery parameters can not be used together.")
	} else if strings.HasPrefix(ocidType, npconsts.OciNodePoolResourceIdent) || nodepoolTagsFound == true {
                // return oci clound provider
        }

Since I see below comments for instancepool, I didn't add any if statement.

	// theoretically the only other possible value is no value (if no node groups are passed in)
	// or instancepool, but either way, we'll just default to the instance pool implementation

return false, reqErr
}
for _, nodePoolSummary := range resp.Items {
klog.V(5).Infof("found nodepool %v", nodePoolSummary)
Copy link
Contributor

@jlamillan jlamillan Oct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The NodePoolSummary contains semi-sensitive fields that we probably shouldn't log unless we have a reason to.

Also, it might be confusing to log found nodepool ... since at this point int the code we don't know whether it has the tags we require.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the log line.

}
for _, nodePoolSummary := range resp.Items {
klog.V(5).Infof("found nodepool %v", nodePoolSummary)
if validateNodepoolTags(nodeGroup.tags, nodePoolSummary.FreeformTags, nodePoolSummary.DefinedTags) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few types of tags including defined tags and free form tags.

As I understand it, user defined tags on a Node Pool resource would appear in the form tags (i.e. nodePoolSummary.FreeformTags not nodePoolSummary.DefinedTags). Is there a reason we're not checking all the tag namspaces for a match?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not only Freeform tags. Users can also create their own namespace and defined tags. We check both of them to make sure we don't miss a tag applied by the user.

Defined tag holds a namespace but FreeForm tag does not.

  • Defined tag : namepsace.tagKey=tagValue
  • Freeform tag: tagKey=tagValue

When we query Nodepool through api, the response returns them in separate fields.

  • FreeformTags is a map[string=>string]
  • DefinedTags is a map[string => map[string]string]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see.

manager.nodeGroups = append(manager.nodeGroups, *nodeGroup)
autoDiscoverNodeGroups(manager, manager.okeClient, *nodeGroup)
}

Copy link
Contributor

@jlamillan jlamillan Oct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like node-pools that were explicitly configured via --nodes should be added before (and take precedent over) node-pools that were discovered via --node-group-auto-discovery.

Do you agree? That also raises the question of the expected behavior of, say, the max or min node setting when a pool is specified via --nodes=2:5:ocid1.nodepool.oc1.np-a and also discovered via --node-group-auto-discovery=clusterId:ocid1.cluster.oc1.c-1,compartmentId:ocid1.compartment.oc1..c1,nodepoolTags:cluster-autoscaler-also/enabled=true,min:0,max:10?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nodeGroupAutoDiscovery actually overrides nodes parameter with this implementation which means nodes parameter is ignored if nodeGroupAutoDiscovery is provided.

What I can think of as a solution,

  1. We could force the user to provide only one of them in the config, so the CA would fail on startup if both of the parameters were provided and also we would log an error line to state the reason for the end-user to see. The end-user should fix the configuration by removing one of them.
  2. If we want both parameters work together, we need to decide which one has a higher priority over the other. I would say nodes parameter should override nodeGroupAutoDiscovery min/max values.

Please let me know of your thoughts and I will proceed accordingly to make the changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The convention seems to be to warn against using it in the docs[1,2], and/or disallow [1] it in the code.

I'm fine with either documenting it and/or errorring out. As you mentioned, currently the code quietly overrides any static node-pools while also logging messages as it processes each static-node pool, which could cause confusion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added extra checks to prevent using both parameters together, and also documented it in oci/README.

Copy link
Contributor

@jlamillan jlamillan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I raised a few issues that need to be resolved.

Additionally, an update to oci/README.md should also be a part of this change that documents the expected format of the discovery string clusterId:<clusterId>,compartmentId:<compartmentId>,nodepoolTags:<tagKey1>=<tagValue1>&<tagKey2>=<tagValue2>,min:<min>,max:<max>, and clarifies which types of tags are expected on the node pool (i.e. free-form or Oracle-Recommended-Tags, or OracleInternalReserved), and any other information that the user needs or that would be helpful to them.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: gvnc
Once this PR has been reviewed and has the lgtm label, please assign jlamillan for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jlamillan
Copy link
Contributor

OK. Changes look good to me. How about you @trungng92 ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler area/provider/oci Issues or PRs related to oci provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants