NoDelayProvisionStrategy won't provision after scaling down to 0 instances in auto scaling group #425

cccCody · 2023-11-21T23:28:35Z

I think this is the same issue as #180

Describe the bug
I'm currently trying to move from ec2-plugin to this plugin, but I'm seeing that the final stage of my build doesn't ever get an executor. My build looks roughly like this:

build step on a single node
test in parallel on several nodes (150 of them!)
collect coverage reports on a single node

Everything works nicely until the last step, where it gets stuck on:

All nodes of label ec2-fleet are offline

When I check the system logs, I see this on repeat:

Nov 21, 2023 3:08:12 PM FINE com.amazon.jenkins.ec2fleet.NoDelayProvisionStrategy apply
label [ec2-fleet]: queueLength 1 availableCapacity 1 (availableExecutors 0 plannedCapacitySnapshot 1 additionalPlannedCapacity 0)
Nov 21, 2023 3:08:12 PM INFO com.amazon.jenkins.ec2fleet.NoDelayProvisionStrategy apply
label [ec2-fleet]: No excess workload, provisioning not needed.

I'm especially suspicious of plannedCapacitySnapshot 1, which, if I'm reading the source code right, seems to mean that it thinks it's already started scaling up another node (and is waiting for it to come online?) but it never does.

Other misc info, may or may not be relevant:

All steps use the same label ("ec2-fleet") and run one executor per node.
cloud configuration includes:
- Minimum Cluster Size: 0
- Maximum Cluster Size: 2000
- Minimum Spare Size: 0
- Maximum Total Uses: 1

Environment Details

Plugin Version?
3.1.0 (latest as of opening this)

Jenkins Version?
2.426.1 (latest LTS version as of opening this issue)

Spot Fleet or ASG?
ASG

Label based fleet?
no

Linux or Windows?
linux

The text was updated successfully, but these errors were encountered:

cccCody · 2023-11-21T23:35:48Z

I was able to get it to work for a single run by setting "Minimum Spare Size" to 1, but then when I started another build after that, it hit the issue when provisioning the first node that time. It seems like, more generally, this is an issue with scaling out shortly after scaling in.

icep87 · 2023-11-22T19:16:16Z

We are also seeing this issue. When there are no agents available, meaning they are all scale down. The plugin won
t spin up agents at all.

In the logs it says:

Nov 22, 2023 7:02:32 PM FINE com.amazon.jenkins.ec2fleet.NoDelayProvisionStrategy
label [linux]: queueLength 1 availableCapacity 1 (availableExecutors 0 plannedCapacitySnapshot 1 additionalPlannedCapacity 0)
Nov 22, 2023 7:02:32 PM INFO com.amazon.jenkins.ec2fleet.NoDelayProvisionStrategy apply
label [linux]: No excess workload, provisioning not needed.
Nov 22, 2023 7:02:32 PM FINE com.amazon.jenkins.ec2fleet.NoDelayProvisionStrategy
label [powerful]: queueLength 1 availableCapacity 1 (availableExecutors 0 plannedCapacitySnapshot 1 additionalPlannedCapacity 0)
Nov 22, 2023 7:02:32 PM INFO com.amazon.jenkins.ec2fleet.NoDelayProvisionStrategy apply
label [powerful]: No excess workload, provisioning not needed.

I'm wondering why it actually says that there is availableCapacity when clearly there is none and no scale up is triggered.

icep87 · 2023-11-28T07:19:42Z

@cccCody Did you manage to find the cause of this?

taka-papa · 2023-12-25T01:57:16Z

I encountered a similar problem
I tried the jenkins script console

Jenkins jenkins = Jenkins.getInstance()

jenkins.getLabels().each { Label label ->
    def nodeProvisioner = label.nodeProvisioner
    def pendingLaunches = nodeProvisioner.getPendingLaunches()

     if (pendingLaunches.size() == 0) {
    	return
     }

    println("Label: ${label.name}")
    pendingLaunches.each {
        println("  Planned Node: ${it.displayName}, Executors: ${it.numExecutors}")
    }
}

Output

Label: xxx
  Planned Node: NodeName-xx, Executors: 1
  Planned Node: NodeName-xx, Executors: 1

There were no jobs running
Restarting jenkins solved it

opajonk · 2024-01-05T12:40:58Z

I think we are running into the same issue here, with the NoDelayProvisioningStrategy. Digging around in the issues I found #149 - this one reads like a regression. Could that be?

Restarting Jenkins "fixed" the issue, but I suspect it will come back. Then I will run the script console snippet of @snowman-papa to see if we also have "stuck planned" machines.

pawel-t · 2024-02-08T11:07:49Z

I have faced the same issue on 3.2.0.

Once we have switched to ASG from SpotFleet. My SpotFleet was setup with Min = 0 and Spare = 0 and after it scaled down to 0 instance. For 1.5h it didn't scale up while jobs were waiting in queue.

I needed to increase min and spare in order it to work.

ldmonkey · 2024-04-19T10:29:29Z

We downgraded the plugin from 3.2.0 to 3.0.1. Waiting for a fix for this issue.

icep87 · 2024-06-28T07:48:00Z

The issue is actually occurring in the NodeProvisioner. We could not find any errors of failed launch in the EC2-Fleet plugins. Instead, the NodeProvisioner has stale nodes planned. This could be fixed by the plugin by monitoring which nodes are planned by the NodeProvisioner and removing them if they haven’t come up after a defined time. Currently, NodeProvisioner does not provide any kind of timestamp indicating when the planned node was requested. Making it quite hard to find without a function that can monitor it.

As mentioned by @taka-papa, you can run the script to see if there are any planned launches of nodes.
If you have staled nodes and want to solve the issue and avoid restarting Jenkins, here is a script that will help you remove the planned launches.

import jenkins.model.Jenkins
import hudson.model.Label
import hudson.slaves.NodeProvisioner

Jenkins jenkins = Jenkins.getInstance()

def labelToReset = "LABEL OF THE CLOUDNODES"

jenkins.getLabels().each { Label label ->
    if (label.name == labelToReset) {
        def nodeProvisioner = label.nodeProvisioner
        def pendingLaunches = nodeProvisioner.getPendingLaunches()

        if (pendingLaunches.size() > 0) {
            println("Cancelling pending launches for label: ${label.name}")
            pendingLaunches.each { launch ->
                launch.future.cancel(true)
                println("Cancelled launch: ${launch.displayName}")
            }
        } else {
            println("No pending launches to cancel for label: ${label.name}")
        }
    }
}

return "Cancellation process completed."

wosiu · 2024-06-28T08:50:07Z

@icep87 What do you mean by "(...) has stale nodes planned"?
Any chances you dig into why they don't "come up after a defined time"?

icep87 · 2024-07-02T07:07:28Z

@wosiu What I mean is that the nodes get stuck in PendingLaunch. They never leave that state. I could not find any errors in the logs, and as this issue is hard to reproduce it makes it difficult to debug. But as soon as I have the chance I will try to debug it.

For example we haven't had this issue for more than 2 weeks now.

PW999 · 2024-08-28T09:35:55Z

I'm constantly having issues with the NoDelayProvisionStrategy.

Right now it says the target is 3

Yet in AWS the desired size is 1

and there is a single instance running, but it doesn't show up in Jenkins at all.
So if you need any specific debug logs, just let me know ;)

ebarped · 2024-10-03T09:37:11Z

We downgraded the plugin from 3.2.0 to 3.0.1. Waiting for a fix for this issue.

We are waiting for this issue to be fixed to start using this plugin. @ldmonkey does the version 3.0.1 work properly?

dsakilesh · 2024-10-18T05:56:03Z

We also face the same issue. We downgraded to 3.0.1 with no luck. Was anyone able to solve the issue? Or any workaround for this issue?

wosiu · 2024-10-18T07:30:26Z

Unfortunately no. We tried few hacks with no consistent luck. Eventually we paused the adoption of the plugin :(

…

On Fri, Oct 18, 2024, 07:56 dsakilesh ***@***.***> wrote: We also face the same issue. We downgraded to 3.0.1 with no luck. Was anyone able to solve the issue? Or any workaround for this issue? — Reply to this email directly, view it on GitHub <#425 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABF4AJC3CPTP44YQ5WTOBI3Z4CPIVAVCNFSM6AAAAAA7VLHL3GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMRRGQ4TGNBUGU> . You are receiving this because you were mentioned.Message ID: ***@***.***>

cccCody added the bug label Nov 21, 2023

cccCody mentioned this issue Nov 21, 2023

NoDelayProvisionStrategy won't provision when starting with 0 instances in auto-scaling fleet #180

Closed

pawel-t mentioned this issue Feb 8, 2024

Instance are not shutting down due to "Protection from scale In" #432

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NoDelayProvisionStrategy won't provision after scaling down to 0 instances in auto scaling group #425

NoDelayProvisionStrategy won't provision after scaling down to 0 instances in auto scaling group #425

cccCody commented Nov 21, 2023 •

edited

Loading

cccCody commented Nov 21, 2023 •

edited

Loading

icep87 commented Nov 22, 2023

icep87 commented Nov 28, 2023

taka-papa commented Dec 25, 2023

opajonk commented Jan 5, 2024 •

edited

Loading

pawel-t commented Feb 8, 2024

ldmonkey commented Apr 19, 2024

icep87 commented Jun 28, 2024

wosiu commented Jun 28, 2024

icep87 commented Jul 2, 2024

PW999 commented Aug 28, 2024

ebarped commented Oct 3, 2024

dsakilesh commented Oct 18, 2024

wosiu commented Oct 18, 2024 via email

NoDelayProvisionStrategy won't provision after scaling down to 0 instances in auto scaling group #425

NoDelayProvisionStrategy won't provision after scaling down to 0 instances in auto scaling group #425

Comments

cccCody commented Nov 21, 2023 • edited Loading

Environment Details

cccCody commented Nov 21, 2023 • edited Loading

icep87 commented Nov 22, 2023

icep87 commented Nov 28, 2023

taka-papa commented Dec 25, 2023

opajonk commented Jan 5, 2024 • edited Loading

pawel-t commented Feb 8, 2024

ldmonkey commented Apr 19, 2024

icep87 commented Jun 28, 2024

wosiu commented Jun 28, 2024

icep87 commented Jul 2, 2024

PW999 commented Aug 28, 2024

ebarped commented Oct 3, 2024

dsakilesh commented Oct 18, 2024

wosiu commented Oct 18, 2024 via email

cccCody commented Nov 21, 2023 •

edited

Loading

cccCody commented Nov 21, 2023 •

edited

Loading

opajonk commented Jan 5, 2024 •

edited

Loading