powerman/redfishpower: avoid query of clearly missing nodes #187

chu11 · 2024-04-30T18:51:30Z

There will always be nodes missing from a cluster, such as when they are removed for maintenance. Sometimes they can be removed for a long period of time.

This means that powerman will always hit a connection timeout for those nodes, thus slowing down pm -q. This is especially true with redfishpower and libcurl which can hit their "message timeout" pretty much all of the time.

If this could be avoided, it'd be a nice for usability.

ideas

using some whatsup(1) type of service (either whatsup itself, or the pingd daemon, or hypothetically FreeIPMI's ipmidetectd, etc. etc.) identify nodes that are clearly missing and report they are off / missing. Unlike normal whatsup(1) the timeouts could be more extreme, like if no pings have been received back in 15 minutes. This is the avoid the habitually bad case when hardware is clearly gone.

The text was updated successfully, but these errors were encountered:

chu11 · 2024-05-01T01:25:46Z

alternately, redfishpower could implement "pings" in the background similar to `ipmipower".

chu11 · 2024-05-01T17:53:47Z

offline discussion with @garlick, i was so focused on doing the same thing that I did in ipmipower that jim realized "why isn't libcurl just erroring out immediately with EHOSTUNREACH" or whatever.

This thread was found

curl/curl#1603

it appears that libcurl considers some errors "transient" and will retry them. AFAICT there is no workround for this to tell libcurl "don't retry on that error".

@garlick had a hacky idea that is simpler than pings .... just do a connect() before issuing the libcurl command. If the connect succeeds immediately, using libcurl is a go.

garlick · 2024-05-01T18:20:25Z

Eh, @garlick may need to review his TCP protocol.

What about just dialing this down?

https://curl.se/libcurl/c/CURLOPT_CONNECTTIMEOUT.html

chu11 · 2024-05-01T18:31:07Z

Hmmmm. My reading of the documentation suggested to me that CURLOPT_TIMEOUT overrides CURLOPT_CONNECTTIMEOUT, but re-reading it again it appears if CURLOPT_CONNECTTIMEOUT is less than CURLOPT_TIMEOUT, it will be handled separately for the connect phase.

hmmmm, perhaps there needs to be a --message-timeout (from PR #186) and --connect-timeout option.

chu11 · 2024-05-02T02:53:19Z

After playing with CURLOPT_CONNECTTIMEOUT (https://github.com/chu11/powerman/tree/redfishpower_connection_timeout), it does indeed work. But it appears that a lower connect timeout does increase the odds of issues occurring. As PR #191 just noted, a connect timeout of 5 seconds leads to a slight increase in errors at scale. So I don't think this is going to be a viable option for us going forward.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

powerman/redfishpower: avoid query of clearly missing nodes #187

powerman/redfishpower: avoid query of clearly missing nodes #187

chu11 commented Apr 30, 2024

chu11 commented May 1, 2024

chu11 commented May 1, 2024 •

edited

Loading

garlick commented May 1, 2024

chu11 commented May 1, 2024

chu11 commented May 2, 2024 •

edited

Loading

powerman/redfishpower: avoid query of clearly missing nodes #187

powerman/redfishpower: avoid query of clearly missing nodes #187

Comments

chu11 commented Apr 30, 2024

chu11 commented May 1, 2024

chu11 commented May 1, 2024 • edited Loading

garlick commented May 1, 2024

chu11 commented May 1, 2024

chu11 commented May 2, 2024 • edited Loading

chu11 commented May 1, 2024 •

edited

Loading

chu11 commented May 2, 2024 •

edited

Loading