Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

idea: add FLUX environment variable that holds "closest enclosing jobid" #6474

Closed
grondo opened this issue Dec 5, 2024 · 11 comments
Closed

Comments

@grondo
Copy link
Contributor

grondo commented Dec 5, 2024

This idea was brought up by @trws on slack and in a project meeting.

Flux currently doesn't have a consistent way to determine the "nearest" enclosing jobid. The cases are somewhat delineated in #3817, though the information there may be outdated (i.e. there does now exist a flux_job_timeleft(3) function). While it makes sense that FLUX_JOB_ID is set in the environment of tasks launched by flux run and flux submit but not in the environment of the initial program in flux alloc and flux batch, this will likely continue to cause confusion and annoyance for users.

One idea put forward by @trws is to add another FLUX_ jobid variable that is always set by the job shell which is not cleared. (Please correct me if I'm mistaken). This variable would leak through to initial programs, which would then be able to use this variable to determine the jobid of their parent instance (if there was a jobid associated) - equivalent to, but more straightforward than using flux getattr jobid. It would also be available in flux run and flux submit where it would be the same as FLUX_JOB_ID. Comparing the two variables would allow users to easily determine if they are in an initial program environment or within a job. Lack of this new environment variable would indicate that there is no enclosing job, i.e. the current process is not within an instance, or the enclosing instance is not itself a job.

Edit: if we enable this feature, that may allow us to close #3817.

@chu11
Copy link
Member

chu11 commented Dec 9, 2024

a random idea I thought of. if we want to avoid spreading too many environment variables, could we support a new command like hypothetically flux job whoami? (i.e. like flux job last).

@grondo
Copy link
Contributor Author

grondo commented Dec 9, 2024

That's a pretty good idea. I wonder if that would satisfy @vsoch and @trws (I think it probably would, but they should weigh in just as well)?

@garlick
Copy link
Member

garlick commented Dec 9, 2024

Would flux job whoami just be an alias for flux getattr jobid then?

@grondo
Copy link
Contributor Author

grondo commented Dec 9, 2024

I think if FLUX_JOB_ID is set, it would return that, otherwise flux getattr jobid. If that failed, then it would return empty?

@vsoch
Copy link
Member

vsoch commented Dec 9, 2024

An environment variable or flux attribute (each of which works consistently across cases) would be great. Another thing we would like to have is storing the equivalent, but for the very top level instance id. E.g.,:

FLUX_TOP_LEVEL_ID=xxx

I am adding a command flux usernetes top-level for our usernetes case that walks up the chain of flux parent-uri until it gets to the top level parent to interact with, and if that is orchestrated more simply via a passed envar (to avoid all the operations to get flux-uris of the parent) that might be a better solution? The use case is in prolog/epilog, for cases when we need to store metadata across all levels of some root instance, it makes sense to put at the top level for everyone to find.

I think whatever you decide to come up with will be hugely helpful, so thank you in advance!

@chu11
Copy link
Member

chu11 commented Dec 9, 2024

I think if FLUX_JOB_ID is set, it would return that, otherwise flux getattr jobid. If that failed, then it would return empty?

Yeah, this is what I was thinking. Just wrap the logic into it.

@trws
Copy link
Member

trws commented Dec 31, 2024

That would be fine I think, it's consistency that matters most. Having it be a command might be best since that means things like flux toplevel ... would work, or if we have one for parent or whatever right, it's more composable that way.

I admit to a personal preference to have access to at least the innermost job ID and matching flux URI be very easy though, since that's what people are most used to from other systems and will need to do naive ports of job scripts that use the enclosing jobid and talk to the system scheduler in batch scripts.

@MrBurmark
Copy link

MrBurmark commented Jan 15, 2025

Having it as a environment variable is more consistent with the behavior of slurm where there is an environment variable for the id for the allocation and for the if of the subjob inside of the allocation. It also saves me from having to do something like execve or something inside of my process.

@grondo
Copy link
Contributor Author

grondo commented Jan 16, 2025

Good point about the extra pain of requiring an execve (or using the Flux C API) to get the jobid attribute.
We do try to avoid the proliferation of environment variables (a possible anti-pattern from Slurm)

We could do something simple like FLUX_ENCLOSING_ID, which would contain the closest enclosing jobid. This would be the same as FLUX_JOB_ID for jobs run directly in a top-level instance. In an initial program (what we call the batch script or program spawned by flux alloc) it would be set to the jobid attribute. Jobs run in the instance would just inherit this variable from the environment.

This captures only the enclosing jobid. If flux alloc or flux batch was run within a batch or alloc job, then this variable would be overwritten with the new batch/alloc jobid. This satisfies the original use case in this issue, but there are also cases proposed above that also want a "top level" jobid and URI. I wonder if we should take a simple approach for now to close this issue, and open a separate issue for the more complex use case of accessing the URI and jobid at any level in a hierarchy of jobs.

@grondo
Copy link
Contributor Author

grondo commented Jan 16, 2025

Oh, @MrBurmark, it just occurred to me as of flux-core v0.70.0 you can request an environment variable be set containing the batch jobid using template substitution, e.g.:

$ flux batch -N1 --env=FLUX_BATCH_ID={{id}} ...
ƒV2VX34zvb

Would result in FLUX_BATCH_ID=ƒV2VX34zvb set in your batch environment. Be cautious that this will be propagated to all jobs (including batch/alloc jobs) run within that batch job.

@MrBurmark
Copy link

MrBurmark commented Jan 17, 2025 via email

grondo added a commit to grondo/flux-core that referenced this issue Jan 17, 2025
Problem: It is inconvenient to get the jobid of the closest enclosing
instance (e.g. the ID of a batch or alloc job in the parent instance)
because the broker drops FLUX_JOB_ID in favor of a jobid attribute.
However, a case has been made that an environment variable would be
more convenient since it can be accessed without use of system(3)
or the Flux API.

Introduce FLUX_ENCLOSING_ID, which is set by the broker whenever the
jobid attribute is set (i.e. when the broker is started as a job in
a Flux instance). This will be available in the initial program as
well as being inherited by jobs run within the instance.

Add the variable to the env_blocklist so that it is unset when the
current instance is not a job.

Fixes flux-framework#6474
grondo added a commit to grondo/flux-core that referenced this issue Jan 17, 2025
Problem: It is inconvenient to get the jobid of the closest enclosing
instance (e.g. the ID of a batch or alloc job in the parent instance)
because the broker drops FLUX_JOB_ID in favor of a jobid attribute.
However, a case has been made that an environment variable would be
more convenient since it can be accessed without use of system(3)
or the Flux API.

Introduce FLUX_ENCLOSING_ID, which is set by the broker whenever the
jobid attribute is set (i.e. when the broker is started as a job in
a Flux instance). This will be available in the initial program as
well as being inherited by jobs run within the instance.

Add the variable to the env_blocklist so that it is unset when the
current instance is not a job.

Fixes flux-framework#6474
@mergify mergify bot closed this as completed in e704384 Jan 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants