Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

awx execution node Another cluster node has determined this instance to be unresponsive #934

Closed
birb57 opened this issue Jan 26, 2024 · 36 comments

Comments

@birb57
Copy link

birb57 commented Jan 26, 2024

Hi

From time to time my execution node instance switch from ready to unavailable (with 100% used capacity)

Work around could be a simple reboot of awx control node with restart of receptor service on the execution node and delete and add instance (execution node) in awx web interface

This occurred 3 times since 2 weeks

We can see below errors on instance (from awx UI) :

first message:
awx execution node Another cluster node has determined this instance to be unresponsive

Second and last:
Receptor error from vmgobemouche.cedelgroup.com, detail:
Work unit expired on Fri Jan 26 08:55:29

Find below receptor.log latest messages:
ERROR 2024/01/26 09:31:17 unknown work unit sp4SpHkv
ERROR 2024/01/26 09:32:22 Error locating unit: uZoGIZvT
ERROR 2024/01/26 09:32:22 unknown work unit uZoGIZvT
ERROR 2024/01/26 09:36:36 Error locating unit: JxKKzyD1
ERROR 2024/01/26 09:36:36 unknown work unit JxKKzyD1
WARNING 2024/01/26 09:39:42 Timing out connection, idle for the past 21s
INFO 2024/01/26 09:39:42 Known Connections:
INFO 2024/01/26 09:39:42 vmgobemouche.cedelgroup.com: awx-task-79b8fff55b-vj8rr(1.00)
INFO 2024/01/26 09:39:42 awx-task-76946f89bc-ssl2l:
INFO 2024/01/26 09:39:42 awx-task-79b8fff55b-vj8rr: vmgobemouche.cedelgroup.com(1.00)
INFO 2024/01/26 09:39:42 Routing Table:
INFO 2024/01/26 09:39:42 awx-task-79b8fff55b-vj8rr via awx-task-79b8fff55b-vj8rr
WARNING 2024/01/26 09:39:44 Could not read in control service: timeout: no recent network activity
ERROR 2024/01/26 09:39:47 INTERNAL_ERROR: no route to node
ERROR 2024/01/26 09:39:47 Write error in control service: INTERNAL_ERROR: no route to node
INFO 2024/01/26 09:52:28 Running control service control
INFO 2024/01/26 09:52:28 Initialization complete
INFO 2024/01/26 09:52:33 Connection established with awx-task-79b8fff55b-vj8rr
INFO 2024/01/26 09:52:33 Known Connections:
INFO 2024/01/26 09:52:33 vmgobemouche.cedelgroup.com: awx-task-79b8fff55b-vj8rr(1.00)
INFO 2024/01/26 09:52:33 awx-task-79b8fff55b-vj8rr: vmgobemouche.cedelgroup.com(1.00)
INFO 2024/01/26 09:52:33 Routing Table:
INFO 2024/01/26 09:52:33 awx-task-79b8fff55b-vj8rr via awx-task-79b8fff55b-vj8rr
INFO 2024/01/26 10:03:56 Connection established with awx-task-76946f89bc-ssl2l
INFO 2024/01/26 10:03:56 Known Connections:
INFO 2024/01/26 10:03:56 vmgobemouche.cedelgroup.com: awx-task-79b8fff55b-vj8rr(1.00) awx-task-76946f89bc-ssl2l(1.00)
INFO 2024/01/26 10:03:56 awx-task-79b8fff55b-vj8rr: vmgobemouche.cedelgroup.com(1.00)
INFO 2024/01/26 10:03:56 awx-task-76946f89bc-ssl2l: vmgobemouche.cedelgroup.com(1.00)
INFO 2024/01/26 10:03:56 Routing Table:
INFO 2024/01/26 10:03:56 awx-task-76946f89bc-ssl2l via awx-task-76946f89bc-ssl2l
INFO 2024/01/26 10:03:56 awx-task-79b8fff55b-vj8rr via awx-task-79b8fff55b-vj8rr
ERROR 2024/01/26 10:05:02 Error locating unit: HZ258Y8C
ERROR 2024/01/26 10:05:02 unknown work unit HZ258Y8C
ERROR 2024/01/26 10:05:32 Error locating unit: 6mqBWE3J
ERROR 2024/01/26 10:05:32 unknown work unit 6mqBWE3J
ERROR 2024/01/26 10:06:14 Error locating unit: mpK7PYaF
ERROR 2024/01/26 10:06:14 unknown work unit mpK7PYaF

awx : 0.23.5 on k3s v1.28.4+k3s2 and ansible-core 2.12.5

Thanks for your support

@AaronH88
Copy link
Contributor

Hey @AlanCoding , did I see you referring to something similar last week?

cc @TheRealHaoLiu & @fosterseth

@fosterseth
Copy link
Member

awx : 0.23.5 do you mean 23.5.0?

@henriquecfg
Copy link

Hi @fosterseth, I'm facing the same problem with awx 23.7.0

Control node:

> receptorctl --socket /var/run/receptor/receptor.sock ping X.X.X.X

ERROR: no route to node
ERROR: no route to node

Control Node UI:
Another cluster node has determined this instance to be unresponsive

Worker Node:

> cat /var/log/receptor/receptor.log

INFO 2024/01/20 16:08:35 Routing Table:
WARNING 2024/01/20 16:08:36 Could not read in control service: INTERNAL_ERROR: no route to node
ERROR 2024/01/20 16:08:37 INTERNAL_ERROR: no route to node
ERROR 2024/01/20 16:08:37 Write error in control service: INTERNAL_ERROR: no route to node
ERROR 2024/01/20 16:08:41 timeout: no recent network activity
ERROR 2024/01/20 16:08:41 Write error in control service: timeout: no recent network activity
ERROR 2024/01/20 16:08:43 INTERNAL_ERROR: no route to node
ERROR 2024/01/20 16:08:43 Write error in control service: INTERNAL_ERROR: no route to node
ERROR 2024/01/20 16:08:44 timeout: no recent network activity
ERROR 2024/01/20 16:08:44 Write error in control service: timeout: no recent network activity
WARNING 2024/01/20 16:08:46 Could not read in control service: timeout: no recent network activity
WARNING 2024/01/20 16:08:47 Could not read in control service: timeout: no recent network activity
WARNING 2024/01/20 16:08:48 Could not read in control service: timeout: no recent network activity
WARNING 2024/01/20 16:08:49 Could not read in control service: timeout: no recent network activity
WARNING 2024/01/20 16:08:51 Could not read in control service: timeout: no recent network activity
ERROR 2024/01/20 16:09:00 timeout: no recent network activity
ERROR 2024/01/20 16:09:00 Write error in control service: timeout: no recent network activity

But once I restart the control node, all my execution instances become available.

Instances usually become unavailable when several jobs are executed simultaneously.

Ty for your support!

@LalosBastien
Copy link

Hi @birb57 and @henriquecfg,

I encountered the same error a few weeks ago as mentioned in issue #890.

I'm currently running AWX version 23.6.0 and faced exactly the same error, with the same workaround (restarting the awx-task pod) proving effective.

Can you please check that you are using a fixed version of the AWX Execution Environment (AWX-EE) image? I managed to resolve this issue by specifying the version in my deployment. Instead of using quay.io/ansible/awx-ee:latest, I switched to quay.io/ansible/awx-ee:23.6.0.

While this may not address the root cause of the problem, it's a more stable workaround than restarting the controller. I've looked through recent commits to identify potential causes but haven't pinpointed the issue yet.

Hope this helps. Have a nice day!

@henriquecfg
Copy link

Thanks @LalosBastien

It is working with quay.io/ansible/awx-ee:23.6.0

in quay.io/ansible/awx-ee:23.7.0 the error persists

@birb57
Copy link
Author

birb57 commented Feb 16, 2024 via email

@LalosBastien
Copy link

Hi @birb57,

To restart awx-task pod you can perform a rollout operation on the deployment using kubectl:

kubectl rollout restart deployment awx-task -n <awx_namespace>

@TheRealHaoLiu, could you please tell us where the awx-ee image is built and published? It appears that the https://github.com/ansible/awx-ee repository is no longer active.

Additionally, is it possible to use the awx image in place of awx-ee ?

@kurokobo
Copy link
Contributor

Could anyone here who facing this issue share your Administration > Topology View screen and receptor logs from both control nodes and execution nodes?

@kurokobo
Copy link
Contributor

kurokobo commented Feb 21, 2024

@LalosBastien

@TheRealHaoLiu, could you please tell us where the awx-ee image is built and published? It appears that the https://github.com/ansible/awx-ee repository is no longer active.

https://github.com/ansible/awx-ee is exact repository for awx-ee image. GitHub Actions on this repository builds and publishes latest image every 12 hours:

Additionally, is it possible to use the awx image in place of awx-ee ?

No, awx-ee contains Receptor, Ansible Runner, and lot of modules that are required to work as EE, but awx image does not. awx image can not act as EE.

@notobsd2
Copy link

notobsd2 commented Mar 8, 2024

It's worth noting on this issue that attempting to reload the receptor, the reload just hangs on any container that ends up being disconnected. Restarting the container as a whole does resolve the issue, though.

@reshifirdous
Copy link

we get similiar issue all we follow restarting task-pods not sure what is the reason for this?

WARNING 2024/03/11 08:51:29 Could not close connection: close unix /var/run/receptor/receptor.sock->@: use of closed network connection DEBUG 2024/03/11 08:51:30 Client connected to control service @ DEBUG 2024/03/11 08:51:30 Control service closed DEBUG 2024/03/11 08:51:30 Client disconnected from control service @ DEBUG 2024/03/11 08:51:31 Client connected to control service @ DEBUG 2024/03/11 08:51:33 Control service closed DEBUG 2024/03/11 08:51:33 Client disconnected from control service @ DEBUG 2024/03/11 08:51:34 Sending routing update 0c1zZmPA. Connections: awx-mesh-ingress(1.00) DEBUG 2024/03/11 08:51:36 Client connected to control service @ DEBUG 2024/03/11 08:51:37 Client connected to control service @ DEBUG 2024/03/11 08:51:38 Client connected to control service @ DEBUG 2024/03/11 08:51:38 Client connected to control service @ DEBUG 2024/03/11 08:51:38 Client connected to control service @ DEBUG 2024/03/11 08:51:38 Control service closed DEBUG 2024/03/11 08:51:38 Client disconnected from control service @ DEBUG 2024/03/11 08:51:40 Client connected to control service @ DEBUG 2024/03/11 08:51:40 Client connected to control service @ DEBUG 2024/03/11 08:51:41 Client connected to control service @ DEBUG 2024/03/11 08:51:41 Client connected to control service @ DEBUG 2024/03/11 08:51:41 Client connected to control service @ DEBUG 2024/03/11 08:51:42 Client connected to control service @ DEBUG 2024/03/11 08:51:42 Control service closed DEBUG 2024/03/11 08:51:42 Client disconnected from control service @ DEBUG 2024/03/11 08:51:43 Client connected to control service @ DEBUG 2024/03/11 08:51:43 Client connected to control service @ DEBUG 2024/03/11 08:51:44 Sending routing update Y5pU6KJs. Connections: awx-mesh-ingress(1.00) DEBUG 2024/03/11 08:51:46 Client connected to control service @ DEBUG 2024/03/11 08:51:46 Control service closed DEBUG 2024/03/11 08:51:46 Client disconnected from control service @ DEBUG 2024/03/11 08:51:50 Client connected to control service @ WARNING 2024/03/11 08:51:52 Timing out connection awx-mesh-ingress, idle for the past 21s

@isuftin
Copy link

isuftin commented Mar 11, 2024

We're seeing this after upgrading to the latest ansible automation platform which also included receptor 1.4.4

@whitej6
Copy link

whitej6 commented Mar 19, 2024

I have been able to fully replicate the issue and can easily reproduce what occurs. I am in the process of doing a full write up with all findings. (I am a consumer of the product and do not represent or have any affiliation to the Ansible GitHub org or any products under the IBM/RedHat umbrella)

The short version is CPU contention on the node running as a receptor execution node and something changed in receptor between build 1.4.3+g4ca9363 and 1.4.3+gdab86f4 that were used in awx-ee image. The last image that appears to better handle CPU contention is quay.io/ansible/awx-ee:23.6.0 and can be set on the AWX k8s resource by overloading the value of control_plane_ee_image.

This is due to what appears to be a flawed calculation of available resources on ALL receptors nodes but more specifically the receptor execution node. AWX/Receptor will throw more jobs at the execution node than it can handle and a process in the awx-task container throws an exception in a thread and must be manually restarted to resolve the issue.

I upgraded my lab environment to v24.0.0 for AWX (using all default values) and the issue persists. The above work around resolves the issue in v24.0.0.

The better work around I would recommend is to dial back the maximum number of forks a receptor execution instance can run (typically the lowest value is still too high) and/or ensure the receptor execution host is running on a CPU with high clock speeds. From my testing AMD EPYC 7571 is NOT performant enough to keep up and will suffer from the above issue but when I swapped out the CPU to Intel Sapphire Rapids 8488C I was no longer able replicate the issue.

@whitej6
Copy link

whitej6 commented Mar 22, 2024

I am almost for certain that the offending commit is this line and it's not allowing the remote receptor enough time to respond

@whitej6
Copy link

whitej6 commented Mar 22, 2024

I will try and build the latest image with rolling back to cancel for the one line.

@whitej6
Copy link

whitej6 commented Mar 22, 2024

Sooooo I think the issue is fixed!!!! I did not realize before my last two comments that v1.4.5 had been released. When I went to make the change an re-compile the receptor binary I found it was reverted back to cancel. I used the following dockerfile to build. Throwing a LOT at my dev instance on generic compute to replicate the exact issue and it appears to remain happy.

FROM ghcr.io/ansible/receptor:v1.4.5 as builder

FROM quay.io/ansible/awx-ee:24.0.0

COPY --from=builder /usr/bin/receptor /usr/bin/receptor

The image is published to docker.io/whitej6/awx-ee:1.4.5 and it appears at least for the AWX side 🤞 the next rebuild will pull in the change and then it becomes an issue on the AWX project on the whole sourcing from a nightly build from devel of another project.

This commit appears to be the fix. c19fdcc

@fosterseth
Copy link
Member

fosterseth commented Mar 25, 2024

@whitej6 we'd still be interested in your test with newest version of quic, but just changing that one line to use a context with cancel instead of a context with timeout

Because as it stands we reverted both the version and changed that one line, so we aren't sure which change actually helped. Ideally we want to get on latest quic version

if you wanted to proceed by building receptor with that change and testing..that would be valuable information for us!

@whitej6
Copy link

whitej6 commented Mar 25, 2024

GO isn't a language I'm super strong in and may not be something I could do quick enough. I believe I should be able to grab v1.4.4 may the change back to cancel and build, but if you by chance have a build of the binary I am happy to use that binary in my test environment.

Also finalizing the last pieces of my findings doc and waiting on approval to post. Sorry for the delay.

@fosterseth
Copy link
Member

@whitej6

quay.io/fosterseth/awx-ee:ctxwithcancel

that awx-ee has the change ^
Not sure if you need both control_plane_ee_image and job ee to be that image, or just the job awx-ee

if you just want the binary you can copy it out of that image, and then copy it into your own awx-ee

@whitej6
Copy link

whitej6 commented Mar 25, 2024

That should be enough. Just needs to be the control plane.

@whitej6
Copy link

whitej6 commented Mar 25, 2024

@fosterseth it appears that the image provided still has the issue. My RHEL host is not behaving so I may need to redeploy that host to be for sure though. (completely locked up)

I do think if I interpreted the code correctly that a 15 second timeout is likely too aggressive. If you wanted to push a few different flavors of the image. (e.g. invert the change, revert quic but 90 second timeout, etc).

@fosterseth
Copy link
Member

fosterseth commented Mar 25, 2024

here is the info for that test

quay.io/fosterseth/awx-ee:ctxwithcancel
v1.4.4 with this diff

diff --git a/pkg/netceptor/conn.go b/pkg/netceptor/conn.go
index 4784684..c2ccaaa 100644
--- a/pkg/netceptor/conn.go
+++ b/pkg/netceptor/conn.go
@@ -305,7 +305,7 @@ func (s *Netceptor) DialContext(ctx context.Context, node string, service string
                        _ = pc.Close()
                })
        }
-       cctx, ccancel := context.WithTimeout(ctx, 15*time.Second)
+       cctx, ccancel := context.WithCancel(ctx)
        go func() {
                select {
                case <-okChan:

@whitej6
Copy link

whitej6 commented Mar 25, 2024

Rerunning the test right now with the same image. I am monitoring CPU/MEM while running on the remote receptor execution. So far second test has been happy. I did not go in to check the state of the remote receptor execution before the first test, there may have been some lingering items that weren't cleaned up properly. Should have results in about 5 minutes of the second test. If the second test works fine I will double the number of jobs I am throwing at it to see if I can break, TCP windowing style 😅

@fosterseth
Copy link
Member

okay good to know, here is another build that HAS the timeout, but set to 90 seconds

quay.io/fosterseth/awx-ee:ctxwithtimeout90

v1.4.4 with this diff

diff --git a/pkg/netceptor/conn.go b/pkg/netceptor/conn.go
index 4784684..e50ce0e 100644
--- a/pkg/netceptor/conn.go
+++ b/pkg/netceptor/conn.go
@@ -305,7 +305,7 @@ func (s *Netceptor) DialContext(ctx context.Context, node string, service string
                        _ = pc.Close()
                })
        }
-       cctx, ccancel := context.WithTimeout(ctx, 15*time.Second)
+       cctx, ccancel := context.WithTimeout(ctx, 90*time.Second)
        go func() {
                select {
                case <-okChan:

@whitej6
Copy link

whitej6 commented Mar 25, 2024

While I wait on the test to finish one thing that appears to occur at the same time is batches finishing and new pods starting up. Meaning I throw 500 jobs at the remote receptor (max forks set to 78, can dial back to 8 but for testing leaving at 78) AWX decides based off of some metric how many jobs it can handle at once. When I have a handful finishing at once at times it appears it's not doing a 1 in 1 out queuing of jobs but sometimes I see 1 finish and 2 start and that seems to be what finally pushes it over the edge. For my current test environment it looks like 37 concurrent jobs (of my test playbook) is the max where it stays happy. It's when it goes over the magical 37 that I am able to replicate in my test environment.

Also it appears in mixed node size kubernetes environments that a compounding factor is all ReceptorJobs are run from the awx-task node that has the MOST available forks by number not percentage.

While writing this up my second test with the first image provided finished and did not have any issues. I will make sure the host is clean and will try the 90 second timeout. If neither will reproduce I will swap back to 24.0.0 to reproduce then clean the host again & go down the route of throwing absolutely everything at it that I can for both images to see if I can break it again.

@fosterseth
Copy link
Member

It's when it goes over the magical 37 that I am able to replicate in my test environment.

the job capacity (how many jobs it can run at once) is a little arbitrary -- it is based on total RAM / CPU of that remote node.

the system should be able to tolerate a handful more jobs over 37 without issue. It could be that the margins are close on your machine, and more jobs past 37 is causing things to tap out, leading to the reproduced issue.

@whitej6
Copy link

whitej6 commented Mar 25, 2024

37 is just the magical number for that specific machine that gets it to the borderline of almost falling over itself. If I throw a different cpu at it with better clock/turbo I can have substantially higher concurrent jobs. The number of concurrent isn't what is interesting it's that if let's say 5 complete at once but 7 or 8 take it's place that it will push it over the edge. It might be that enough available capacity is there for X seconds so it over subscribes enough to be problematic.

Starting the timeout test now. Had back to back calls

@whitej6
Copy link

whitej6 commented Mar 25, 2024

From my testing AMD EPYC 7571 is NOT performant enough to keep up and will suffer from the above issue but when I swapped out the CPU to Intel Sapphire Rapids 8488C I was no longer able replicate the issue.

Part of this goes back to how well the python portion runs behind the scenes and how much it's impacted by the CPU clock speed itself. The Intel CPU mentioned above has shown to be 4X more performant in python <3.11 from my testing for arbitrary Django code (my day job), haven't compared performance deltas on 3.11 yet but wouldn't be surprised to see a 5X bump.

The less performant CPU clock speed from my testing is a contributing factor, and my test environment I have on under-provisioned generic compute specifically to make it "easier to break" so I can replicate the issue.

@whitej6
Copy link

whitej6 commented Mar 25, 2024

90 second timeout image appears good as well. Going to redo the cancel image test again but looks like both are viable fixes.

@fosterseth
Copy link
Member

@whitej6 excellent. thanks for reporting back your findings!

@whitej6
Copy link

whitej6 commented Mar 26, 2024

@fosterseth I ran through the test again and had the same results. Cancel or 90 second timeout appear to work without problem. Ran again with the 24.0.0 tag just to verify it does break and it did. If you would like me to run any additional chaos testing let me know which image tag to test with and I am happy to do so.

@fosterseth
Copy link
Member

@whitej6 thanks for the update and your work on this -- this is encouraging. The receptor devs are working testing upgrading quic, but keeping the ctx cancel line

#981

@whitej6
Copy link

whitej6 commented Mar 27, 2024

@fosterseth heads up I saw AWX v24.1.0 dropped yesterday using receptor v1.4.5, good news it is good :-)

@whitej6
Copy link

whitej6 commented Mar 28, 2024

@fosterseth Out of curiosity is see that parts of AWX are now using py3.11 but the EE image is still on 3.9. From my testing on some operations (especially if iteration is heavily used) py3.11 can be up to 20% more performant.

@fosterseth
Copy link
Member

@whitej6 thanks for pointing that out, @TheRealHaoLiu is working on bumping up to 3.11

@TheRealHaoLiu
Copy link
Member

thanks for all the hard work you did @whitej6 closing this issue since its resolved now

come hang out with us more often 😉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests