-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
context with timeout #4007
base: main
Are you sure you want to change the base?
context with timeout #4007
Conversation
@dsxing and how do you think the two timeouts will interact? |
Maybe relates to #3904 (#3904 (comment)) |
The two timeouts are set to the same value with the purpose of halting systemd operations when the original opencontainers/runc logic returns a timeout error. This is because after runc returns an error, kubelet will perform a re-sync and call opencontainers/runc again to start the task. |
I encountered a scenario where there was a memory leak in kubelet, due to systemd hanging, the kubelet logs showed a significant number of "Failed to delete cgroup paths" and "Timed out while waiting for systemd to remove..." messages. Upon analyzing the kubelet stack traces, I found that the opencontainers/runc was making calls to the go-systemd library, and there were numerous deadlocked goroutines in go-systemd. Therefore, I am wondering if it's possible to pass the timeout information from runc to go-systemd to cancel go-systemd tasks, thus avoiding the accumulation of goroutines that could lead to a memory leak in kubelet. Note that systemd cgroup v1 is being used. |
@dsxing Please signed off your commit first.
next time, please use |
Signed-off-by: dsxing <[email protected]>
ok, it's done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concerns (or, maybe, questions) here are:
- How does two 30-second timeouts interact (as they are to be hit at about the same time)?
- How does a newly added context interacts with retryOnDisconnect?
- Why do we need two timeouts? Maybe it's better to replace the old one with the one from context?
Without a good understanding of these topics I'm hesitant to approve this.
Signed-off-by: dsxing <[email protected]>
Signed-off-by: dsxing <[email protected]>
I also think we should use the same timeout parameter. Additionally, I understand that the role of retryOnDisconnect is to reconnect when the DBus connection is lost, and it's not directly related to the results of the function call, which are reported back in the 'err' parameter. When err is nil, the final determination of the call's result will be made through statusChan. |
We set a timeout of 30 seconds when doing stopUnit/startUnit, so we may need to pass the timeout args with the context