Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pals: make eventlog timeout configurable #254

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

jameshcorbett
Copy link
Member

Problem: sometimes 10 seconds is not enough of a timeout for reading from the eventlog if lustre is hanging.

Make the timeout configurable.

Problem: sometimes 10 seconds is not enough of a timeout for reading
from the eventlog if lustre is hanging.

Make the timeout configurable.
@jameshcorbett jameshcorbett requested a review from grondo January 25, 2025 00:09
Copy link
Contributor

@grondo grondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks reasonable to me!

Stepping back a bit though, I'm still not sure exactly what went wrong in the failures we saw recently. If I remember correctly how this is supposed to work, the pals shell plugin is reading the job eventlog in order to fetch the cray_port_distribution event (and only that?), which is emitted from a jobtap plugin prolog.

Since the job shells are not started until after all prolog-finish events, the cray_port_distribution event should be guaranteed to be in the eventlog before the first shell is started (modulo the batch commit interval in the job-manager), so I'm still confused on why job shells were stuck waiting for that event (and looking back at the job in question, event the job start event appears in the eventlog)

Do you recall if we ever narrowed down why the eventlog read could time out in this situation?

I guess as another comment, it would be nice to know if the eventlog watch returned any events, or if it was stuck after reading a particular event. To that end, maybe a second commit here could amend the error message in this function when it times out to include the last event (if any) that was seen in the eventlog. I guess that may help us determine if the call received any response at all (my guess is not).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants