-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
awx execution node Another cluster node has determined this instance to be unresponsive #934
Comments
Hey @AlanCoding , did I see you referring to something similar last week? |
|
Hi @fosterseth, I'm facing the same problem with awx 23.7.0 Control node:
Control Node UI: Worker Node:
But once I restart the control node, all my execution instances become available. Instances usually become unavailable when several jobs are executed simultaneously. Ty for your support! |
Hi @birb57 and @henriquecfg, I encountered the same error a few weeks ago as mentioned in issue #890. I'm currently running AWX version Can you please check that you are using a fixed version of the AWX Execution Environment (AWX-EE) image? I managed to resolve this issue by specifying the version in my deployment. Instead of using While this may not address the root cause of the problem, it's a more stable workaround than restarting the controller. I've looked through recent commits to identify potential causes but haven't pinpointed the issue yet. Hope this helps. Have a nice day! |
Thanks @LalosBastien It is working with quay.io/ansible/awx-ee:23.6.0 in quay.io/ansible/awx-ee:23.7.0 the error persists |
Hi
How can I restart the awx-tasks pod alone ?
Thanks
Le lun. 12 févr. 2024, 17:03, Munchks ***@***.***> a écrit :
… Hi @birb57 <https://github.com/birb57> and @henriquecfg
<https://github.com/henriquecfg>,
I encountered the same error a few weeks ago as mentioned in issue #890
<#890>.
I'm currently running AWX version 23.6.0 and faced exactly the same
error, with the same workaround (restarting the awx-task pod) proving
effective.
Can you please check that you are using a fixed version of the AWX
Execution Environment (AWX-EE) image? I managed to resolve this issue by
specifying the version in my deployment. Instead of using
quay.io/ansible/awx-ee:latest, I switched to quay.io/ansible/awx-ee:23.6.0
.
While this may not address the root cause of the problem, it's a more
stable workaround than restarting the controller. I've looked through
recent commits to identify potential causes but haven't pinpointed the
issue yet.
Hope this helps. Have a nice day!
—
Reply to this email directly, view it on GitHub
<#934 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AE7AIBZLZE6I7QJGOLBKS5LYTI4L7AVCNFSM6AAAAABCL22UAWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZYHE4TCMJUHE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Hi @birb57, To restart kubectl rollout restart deployment awx-task -n <awx_namespace> @TheRealHaoLiu, could you please tell us where the Additionally, is it possible to use the |
Could anyone here who facing this issue share your |
https://github.com/ansible/awx-ee is exact repository for
No, |
It's worth noting on this issue that attempting to reload the receptor, the reload just hangs on any container that ends up being disconnected. Restarting the container as a whole does resolve the issue, though. |
we get similiar issue all we follow restarting task-pods not sure what is the reason for this?
|
We're seeing this after upgrading to the latest ansible automation platform which also included receptor 1.4.4 |
I have been able to fully replicate the issue and can easily reproduce what occurs. I am in the process of doing a full write up with all findings. (I am a consumer of the product and do not represent or have any affiliation to the Ansible GitHub org or any products under the IBM/RedHat umbrella) The short version is CPU contention on the node running as a receptor execution node and something changed in receptor between build This is due to what appears to be a flawed calculation of available resources on ALL receptors nodes but more specifically the receptor execution node. AWX/Receptor will throw more jobs at the execution node than it can handle and a process in the awx-task container throws an exception in a thread and must be manually restarted to resolve the issue. I upgraded my lab environment to v24.0.0 for AWX (using all default values) and the issue persists. The above work around resolves the issue in v24.0.0. The better work around I would recommend is to dial back the maximum number of forks a receptor execution instance can run (typically the lowest value is still too high) and/or ensure the receptor execution host is running on a CPU with high clock speeds. From my testing |
I am almost for certain that the offending commit is this line and it's not allowing the remote receptor enough time to respond |
I will try and build the latest image with rolling back to cancel for the one line. |
Sooooo I think the issue is fixed!!!! I did not realize before my last two comments that v1.4.5 had been released. When I went to make the change an re-compile the receptor binary I found it was reverted back to cancel. I used the following dockerfile to build. Throwing a LOT at my dev instance on generic compute to replicate the exact issue and it appears to remain happy. FROM ghcr.io/ansible/receptor:v1.4.5 as builder
FROM quay.io/ansible/awx-ee:24.0.0
COPY --from=builder /usr/bin/receptor /usr/bin/receptor The image is published to This commit appears to be the fix. c19fdcc |
@whitej6 we'd still be interested in your test with newest version of quic, but just changing that one line to use a context with cancel instead of a context with timeout Because as it stands we reverted both the version and changed that one line, so we aren't sure which change actually helped. Ideally we want to get on latest quic version if you wanted to proceed by building receptor with that change and testing..that would be valuable information for us! |
GO isn't a language I'm super strong in and may not be something I could do quick enough. I believe I should be able to grab v1.4.4 may the change back to cancel and build, but if you by chance have a build of the binary I am happy to use that binary in my test environment. Also finalizing the last pieces of my findings doc and waiting on approval to post. Sorry for the delay. |
quay.io/fosterseth/awx-ee:ctxwithcancel that awx-ee has the change ^ if you just want the binary you can copy it out of that image, and then copy it into your own awx-ee |
That should be enough. Just needs to be the control plane. |
@fosterseth it appears that the image provided still has the issue. My RHEL host is not behaving so I may need to redeploy that host to be for sure though. (completely locked up) I do think if I interpreted the code correctly that a 15 second timeout is likely too aggressive. If you wanted to push a few different flavors of the image. (e.g. invert the change, revert quic but 90 second timeout, etc). |
here is the info for that test quay.io/fosterseth/awx-ee:ctxwithcancel
|
Rerunning the test right now with the same image. I am monitoring CPU/MEM while running on the remote receptor execution. So far second test has been happy. I did not go in to check the state of the remote receptor execution before the first test, there may have been some lingering items that weren't cleaned up properly. Should have results in about 5 minutes of the second test. If the second test works fine I will double the number of jobs I am throwing at it to see if I can break, TCP windowing style 😅 |
okay good to know, here is another build that HAS the timeout, but set to 90 seconds quay.io/fosterseth/awx-ee:ctxwithtimeout90 v1.4.4 with this diff
|
While I wait on the test to finish one thing that appears to occur at the same time is batches finishing and new pods starting up. Meaning I throw 500 jobs at the remote receptor (max forks set to 78, can dial back to 8 but for testing leaving at 78) AWX decides based off of some metric how many jobs it can handle at once. When I have a handful finishing at once at times it appears it's not doing a 1 in 1 out queuing of jobs but sometimes I see 1 finish and 2 start and that seems to be what finally pushes it over the edge. For my current test environment it looks like 37 concurrent jobs (of my test playbook) is the max where it stays happy. It's when it goes over the magical 37 that I am able to replicate in my test environment. Also it appears in mixed node size kubernetes environments that a compounding factor is all While writing this up my second test with the first image provided finished and did not have any issues. I will make sure the host is clean and will try the 90 second timeout. If neither will reproduce I will swap back to |
the job capacity (how many jobs it can run at once) is a little arbitrary -- it is based on total RAM / CPU of that remote node. the system should be able to tolerate a handful more jobs over 37 without issue. It could be that the margins are close on your machine, and more jobs past 37 is causing things to tap out, leading to the reproduced issue. |
37 is just the magical number for that specific machine that gets it to the borderline of almost falling over itself. If I throw a different cpu at it with better clock/turbo I can have substantially higher concurrent jobs. The number of concurrent isn't what is interesting it's that if let's say 5 complete at once but 7 or 8 take it's place that it will push it over the edge. It might be that enough available capacity is there for X seconds so it over subscribes enough to be problematic. Starting the timeout test now. Had back to back calls |
Part of this goes back to how well the python portion runs behind the scenes and how much it's impacted by the CPU clock speed itself. The Intel CPU mentioned above has shown to be 4X more performant in python <3.11 from my testing for arbitrary Django code (my day job), haven't compared performance deltas on 3.11 yet but wouldn't be surprised to see a 5X bump. The less performant CPU clock speed from my testing is a contributing factor, and my test environment I have on under-provisioned generic compute specifically to make it "easier to break" so I can replicate the issue. |
90 second timeout image appears good as well. Going to redo the cancel image test again but looks like both are viable fixes. |
@whitej6 excellent. thanks for reporting back your findings! |
@fosterseth I ran through the test again and had the same results. Cancel or 90 second timeout appear to work without problem. Ran again with the 24.0.0 tag just to verify it does break and it did. If you would like me to run any additional chaos testing let me know which image tag to test with and I am happy to do so. |
@fosterseth heads up I saw AWX v24.1.0 dropped yesterday using receptor v1.4.5, good news it is good :-) |
@fosterseth Out of curiosity is see that parts of AWX are now using py3.11 but the EE image is still on 3.9. From my testing on some operations (especially if iteration is heavily used) py3.11 can be up to 20% more performant. |
@whitej6 thanks for pointing that out, @TheRealHaoLiu is working on bumping up to 3.11 |
thanks for all the hard work you did @whitej6 closing this issue since its resolved now come hang out with us more often 😉 |
Hi
From time to time my execution node instance switch from ready to unavailable (with 100% used capacity)
Work around could be a simple reboot of awx control node with restart of receptor service on the execution node and delete and add instance (execution node) in awx web interface
This occurred 3 times since 2 weeks
We can see below errors on instance (from awx UI) :
first message:
awx execution node Another cluster node has determined this instance to be unresponsive
Second and last:
Receptor error from vmgobemouche.cedelgroup.com, detail:
Work unit expired on Fri Jan 26 08:55:29
Find below receptor.log latest messages:
ERROR 2024/01/26 09:31:17 unknown work unit sp4SpHkv
ERROR 2024/01/26 09:32:22 Error locating unit: uZoGIZvT
ERROR 2024/01/26 09:32:22 unknown work unit uZoGIZvT
ERROR 2024/01/26 09:36:36 Error locating unit: JxKKzyD1
ERROR 2024/01/26 09:36:36 unknown work unit JxKKzyD1
WARNING 2024/01/26 09:39:42 Timing out connection, idle for the past 21s
INFO 2024/01/26 09:39:42 Known Connections:
INFO 2024/01/26 09:39:42 vmgobemouche.cedelgroup.com: awx-task-79b8fff55b-vj8rr(1.00)
INFO 2024/01/26 09:39:42 awx-task-76946f89bc-ssl2l:
INFO 2024/01/26 09:39:42 awx-task-79b8fff55b-vj8rr: vmgobemouche.cedelgroup.com(1.00)
INFO 2024/01/26 09:39:42 Routing Table:
INFO 2024/01/26 09:39:42 awx-task-79b8fff55b-vj8rr via awx-task-79b8fff55b-vj8rr
WARNING 2024/01/26 09:39:44 Could not read in control service: timeout: no recent network activity
ERROR 2024/01/26 09:39:47 INTERNAL_ERROR: no route to node
ERROR 2024/01/26 09:39:47 Write error in control service: INTERNAL_ERROR: no route to node
INFO 2024/01/26 09:52:28 Running control service control
INFO 2024/01/26 09:52:28 Initialization complete
INFO 2024/01/26 09:52:33 Connection established with awx-task-79b8fff55b-vj8rr
INFO 2024/01/26 09:52:33 Known Connections:
INFO 2024/01/26 09:52:33 vmgobemouche.cedelgroup.com: awx-task-79b8fff55b-vj8rr(1.00)
INFO 2024/01/26 09:52:33 awx-task-79b8fff55b-vj8rr: vmgobemouche.cedelgroup.com(1.00)
INFO 2024/01/26 09:52:33 Routing Table:
INFO 2024/01/26 09:52:33 awx-task-79b8fff55b-vj8rr via awx-task-79b8fff55b-vj8rr
INFO 2024/01/26 10:03:56 Connection established with awx-task-76946f89bc-ssl2l
INFO 2024/01/26 10:03:56 Known Connections:
INFO 2024/01/26 10:03:56 vmgobemouche.cedelgroup.com: awx-task-79b8fff55b-vj8rr(1.00) awx-task-76946f89bc-ssl2l(1.00)
INFO 2024/01/26 10:03:56 awx-task-79b8fff55b-vj8rr: vmgobemouche.cedelgroup.com(1.00)
INFO 2024/01/26 10:03:56 awx-task-76946f89bc-ssl2l: vmgobemouche.cedelgroup.com(1.00)
INFO 2024/01/26 10:03:56 Routing Table:
INFO 2024/01/26 10:03:56 awx-task-76946f89bc-ssl2l via awx-task-76946f89bc-ssl2l
INFO 2024/01/26 10:03:56 awx-task-79b8fff55b-vj8rr via awx-task-79b8fff55b-vj8rr
ERROR 2024/01/26 10:05:02 Error locating unit: HZ258Y8C
ERROR 2024/01/26 10:05:02 unknown work unit HZ258Y8C
ERROR 2024/01/26 10:05:32 Error locating unit: 6mqBWE3J
ERROR 2024/01/26 10:05:32 unknown work unit 6mqBWE3J
ERROR 2024/01/26 10:06:14 Error locating unit: mpK7PYaF
ERROR 2024/01/26 10:06:14 unknown work unit mpK7PYaF
awx : 0.23.5 on k3s v1.28.4+k3s2 and ansible-core 2.12.5
Thanks for your support
The text was updated successfully, but these errors were encountered: