Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error back channel / side channel #233

Closed
lars-t-hansen opened this issue Jan 15, 2025 · 2 comments
Closed

Error back channel / side channel #233

lars-t-hansen opened this issue Jan 15, 2025 · 2 comments
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@lars-t-hansen
Copy link
Collaborator

The normal case with errors is that sonar starts and produces output, but that some error occurs during the run that prevents it from completing or producing correct data. (The abnormal case is that sonar does not run at all; in this case, heartbeat messages will not arrive at the receiver and the receiver has the opportunity to discover this.)

In the normal error case we have a couple of options, not exclusive:

  • ignore it, hope it was transitional, and if not, hope that the failure of heartbeats to arrive at the target will alert somebody
  • log to syslog, hope somebody sees it
  • log on an error side channel, so that the target can surface the problem when errors arrive

In the interest of making errors actionable I think the third option is best, and it doesn't cost us much.

We are already multiplexing two data streams on the sonar ps stream - the sample stream and the per-system load stream, with the "load" and "gpuinfo" fields - so there's no sin in adding a third, or if you like, a third field to the existing side channel. In the CSV data this would likely be an "error" field that shows up in the first record, along with "load" and "gpuinfo" except it would also show up in heartbeat records, and in JSON data it would just be an additional "error" field at the top level. The field would be absent if there are no errors to report. The payload would be a string. A complication is that we don't want to push literal newlines in either type of data, so the string should be url-encoded or base64-encoded or in some other way safe for transmission. (Neither encoding uses " or , and both are good for both CSV and JSON. Base64 adds about 33% overhead for all data. URL-encoding adds little overhead for ASCII text, which is the expected case.)

Also see #201 #232.

@lars-t-hansen lars-t-hansen added the enhancement New feature or request label Jan 15, 2025
@lars-t-hansen
Copy link
Collaborator Author

I guess we have to consider not just the ps command, even though that is most important, but also the sysinfo and slurm commands. For sysinfo, we're sending one JSON object (or one line of CSV) that can take a new error field. For slurm, we're currently sending multiple lines of CSV or multiple lines of JSON (once #228 lands), but here we'd probably be better off sending a single JSON object with an embedded array of job data, so that fields like the error field can be attached to the enclosing object.

@lars-t-hansen
Copy link
Collaborator Author

We could also use this as an informational conduit, so in addition to an "error" key in the output data there could be an "info" key. This comes up in the case of the ps command, where there are informational messages (lockfile could not be deleted) and some possibly soft errors that should not cause the command to fail.

lars-t-hansen pushed a commit to lars-t-hansen/sonar that referenced this issue Jan 24, 2025
lars-t-hansen pushed a commit to lars-t-hansen/sonar that referenced this issue Jan 29, 2025
lars-t-hansen pushed a commit to lars-t-hansen/sonar that referenced this issue Jan 29, 2025
lars-t-hansen pushed a commit to lars-t-hansen/sonar that referenced this issue Jan 29, 2025
@lars-t-hansen lars-t-hansen added this to the v0.13 milestone Jan 29, 2025
@bast bast closed this as completed in 383a654 Jan 29, 2025
bast added a commit that referenced this issue Jan 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant