Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

powerman/redfishpower: avoid query of clearly missing nodes #187

Open
chu11 opened this issue Apr 30, 2024 · 5 comments
Open

powerman/redfishpower: avoid query of clearly missing nodes #187

chu11 opened this issue Apr 30, 2024 · 5 comments

Comments

@chu11
Copy link
Member

chu11 commented Apr 30, 2024

There will always be nodes missing from a cluster, such as when they are removed for maintenance. Sometimes they can be removed for a long period of time.

This means that powerman will always hit a connection timeout for those nodes, thus slowing down pm -q. This is especially true with redfishpower and libcurl which can hit their "message timeout" pretty much all of the time.

If this could be avoided, it'd be a nice for usability.

ideas

using some whatsup(1) type of service (either whatsup itself, or the pingd daemon, or hypothetically FreeIPMI's ipmidetectd, etc. etc.) identify nodes that are clearly missing and report they are off / missing. Unlike normal whatsup(1) the timeouts could be more extreme, like if no pings have been received back in 15 minutes. This is the avoid the habitually bad case when hardware is clearly gone.

@chu11
Copy link
Member Author

chu11 commented May 1, 2024

alternately, redfishpower could implement "pings" in the background similar to `ipmipower".

@chu11
Copy link
Member Author

chu11 commented May 1, 2024

offline discussion with @garlick, i was so focused on doing the same thing that I did in ipmipower that jim realized "why isn't libcurl just erroring out immediately with EHOSTUNREACH" or whatever.

This thread was found

curl/curl#1603

it appears that libcurl considers some errors "transient" and will retry them. AFAICT there is no workround for this to tell libcurl "don't retry on that error".

@garlick had a hacky idea that is simpler than pings .... just do a connect() before issuing the libcurl command. If the connect succeeds immediately, using libcurl is a go.

@garlick
Copy link
Member

garlick commented May 1, 2024

Eh, @garlick may need to review his TCP protocol.

What about just dialing this down?

https://curl.se/libcurl/c/CURLOPT_CONNECTTIMEOUT.html

@chu11
Copy link
Member Author

chu11 commented May 1, 2024

Hmmmm. My reading of the documentation suggested to me that CURLOPT_TIMEOUT overrides CURLOPT_CONNECTTIMEOUT, but re-reading it again it appears if CURLOPT_CONNECTTIMEOUT is less than CURLOPT_TIMEOUT, it will be handled separately for the connect phase.

hmmmm, perhaps there needs to be a --message-timeout (from PR #186) and --connect-timeout option.

@chu11
Copy link
Member Author

chu11 commented May 2, 2024

After playing with CURLOPT_CONNECTTIMEOUT (https://github.com/chu11/powerman/tree/redfishpower_connection_timeout), it does indeed work. But it appears that a lower connect timeout does increase the odds of issues occurring. As PR #191 just noted, a connect timeout of 5 seconds leads to a slight increase in errors at scale. So I don't think this is going to be a viable option for us going forward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants