-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
powerman/redfishpower: avoid query of clearly missing nodes #187
Comments
alternately, |
offline discussion with @garlick, i was so focused on doing the same thing that I did in ipmipower that jim realized "why isn't libcurl just erroring out immediately with EHOSTUNREACH" or whatever. This thread was found it appears that libcurl considers some errors "transient" and will retry them. AFAICT there is no workround for this to tell libcurl "don't retry on that error". @garlick had a hacky idea that is simpler than pings .... just do a |
Eh, @garlick may need to review his TCP protocol. What about just dialing this down? |
Hmmmm. My reading of the documentation suggested to me that CURLOPT_TIMEOUT overrides CURLOPT_CONNECTTIMEOUT, but re-reading it again it appears if CURLOPT_CONNECTTIMEOUT is less than CURLOPT_TIMEOUT, it will be handled separately for the connect phase. hmmmm, perhaps there needs to be a |
After playing with CURLOPT_CONNECTTIMEOUT (https://github.com/chu11/powerman/tree/redfishpower_connection_timeout), it does indeed work. But it appears that a lower connect timeout does increase the odds of issues occurring. As PR #191 just noted, a connect timeout of 5 seconds leads to a slight increase in errors at scale. So I don't think this is going to be a viable option for us going forward. |
There will always be nodes missing from a cluster, such as when they are removed for maintenance. Sometimes they can be removed for a long period of time.
This means that powerman will always hit a connection timeout for those nodes, thus slowing down
pm -q
. This is especially true withredfishpower
andlibcurl
which can hit their "message timeout" pretty much all of the time.If this could be avoided, it'd be a nice for usability.
ideas
using some
whatsup(1)
type of service (either whatsup itself, or thepingd
daemon, or hypothetically FreeIPMI's ipmidetectd, etc. etc.) identify nodes that are clearly missing and report they are off / missing. Unlike normalwhatsup(1)
the timeouts could be more extreme, like if no pings have been received back in 15 minutes. This is the avoid the habitually bad case when hardware is clearly gone.The text was updated successfully, but these errors were encountered: