-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NoDelayProvisionStrategy won't provision after scaling down to 0 instances in auto scaling group #425
Comments
I was able to get it to work for a single run by setting "Minimum Spare Size" to 1, but then when I started another build after that, it hit the issue when provisioning the first node that time. It seems like, more generally, this is an issue with scaling out shortly after scaling in. |
We are also seeing this issue. When there are no agents available, meaning they are all scale down. The plugin won In the logs it says:
I'm wondering why it actually says that there is availableCapacity when clearly there is none and no scale up is triggered. |
@cccCody Did you manage to find the cause of this? |
I encountered a similar problem
Output
There were no jobs running |
I think we are running into the same issue here, with the NoDelayProvisioningStrategy. Digging around in the issues I found #149 - this one reads like a regression. Could that be? Restarting Jenkins "fixed" the issue, but I suspect it will come back. Then I will run the script console snippet of @snowman-papa to see if we also have "stuck planned" machines. |
I have faced the same issue on 3.2.0. Once we have switched to ASG from SpotFleet. My SpotFleet was setup with Min = 0 and Spare = 0 and after it scaled down to 0 instance. For 1.5h it didn't scale up while jobs were waiting in queue. I needed to increase min and spare in order it to work. |
We downgraded the plugin from 3.2.0 to 3.0.1. Waiting for a fix for this issue. |
The issue is actually occurring in the NodeProvisioner. We could not find any errors of failed launch in the EC2-Fleet plugins. Instead, the NodeProvisioner has stale nodes planned. This could be fixed by the plugin by monitoring which nodes are planned by the NodeProvisioner and removing them if they haven’t come up after a defined time. Currently, NodeProvisioner does not provide any kind of timestamp indicating when the planned node was requested. Making it quite hard to find without a function that can monitor it. As mentioned by @taka-papa, you can run the script to see if there are any planned launches of nodes.
|
@icep87 What do you mean by "(...) has stale nodes planned"? |
@wosiu What I mean is that the nodes get stuck in PendingLaunch. They never leave that state. I could not find any errors in the logs, and as this issue is hard to reproduce it makes it difficult to debug. But as soon as I have the chance I will try to debug it. For example we haven't had this issue for more than 2 weeks now. |
We are waiting for this issue to be fixed to start using this plugin. @ldmonkey does the version |
We also face the same issue. We downgraded to 3.0.1 with no luck. Was anyone able to solve the issue? Or any workaround for this issue? |
Unfortunately no. We tried few hacks with no consistent luck. Eventually we
paused the adoption of the plugin :(
…On Fri, Oct 18, 2024, 07:56 dsakilesh ***@***.***> wrote:
We also face the same issue. We downgraded to 3.0.1 with no luck. Was
anyone able to solve the issue? Or any workaround for this issue?
—
Reply to this email directly, view it on GitHub
<#425 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABF4AJC3CPTP44YQ5WTOBI3Z4CPIVAVCNFSM6AAAAAA7VLHL3GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMRRGQ4TGNBUGU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I think this is the same issue as #180
Describe the bug
I'm currently trying to move from ec2-plugin to this plugin, but I'm seeing that the final stage of my build doesn't ever get an executor. My build looks roughly like this:
Everything works nicely until the last step, where it gets stuck on:
When I check the system logs, I see this on repeat:
I'm especially suspicious of
plannedCapacitySnapshot 1
, which, if I'm reading the source code right, seems to mean that it thinks it's already started scaling up another node (and is waiting for it to come online?) but it never does.Other misc info, may or may not be relevant:
Environment Details
Plugin Version?
3.1.0 (latest as of opening this)
Jenkins Version?
2.426.1 (latest LTS version as of opening this issue)
Spot Fleet or ASG?
ASG
Label based fleet?
no
Linux or Windows?
linux
The text was updated successfully, but these errors were encountered: