Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dws: add negative requires hostlist for down rabbits #252

Conversation

jameshcorbett
Copy link
Member

@jameshcorbett jameshcorbett commented Jan 16, 2025

Problem: Fluxion continues to have problems with rabbits on
production systems. However, users need a way for Flux to
automatically schedule around down rabbits.

As a workaround, maintain a hostlist of all nodes that are attached
to down rabbits. Add this hostlist to jobs' constraints, as a negative hostlist constraint (like
--requires=-hosts:foobar[1-20])

Problem: Fluxion continues to have problems with rabbits on
production systems. However, users need a way for Flux to
automatically schedule around down rabbits.

As a workaround, maintain a hostlist of all nodes that are attached
to down rabbits. Eventually this hostlist should be added to jobs'
constraints, as a negative hostlist constraint (like
--requires=-hosts:foobar[1-20])
Copy link
Member

@cmoussa1 cmoussa1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 👍 One minor clarification question

src/job-manager/plugins/dws-jobtap.c Show resolved Hide resolved
src/job-manager/plugins/dws-jobtap.c Outdated Show resolved Hide resolved
@jameshcorbett jameshcorbett force-pushed the add-requires-hostlist-for-down-rabbits branch 2 times, most recently from aa6c04d to 06431ed Compare January 17, 2025 17:22
Problem: while Fluxion scheduling of rabbits continues to have
problems in production, a stopgap is needed to prevent rabbit jobs
from running on nodes attached to down rabbits.

Make the dws-jobtap plugin add a hostlist constraint for rabbit jobs
based on the hostlist of nodes with down rabbits that coral2_dws
sends.
Problem: there are no tests to ensure that the dynamic rabbit
constraints imposed by coral2-dws actually work as intended.

Add tests.
@jameshcorbett jameshcorbett force-pushed the add-requires-hostlist-for-down-rabbits branch from 06431ed to 7509ba4 Compare January 17, 2025 17:24
@mergify mergify bot merged commit e15e1da into flux-framework:master Jan 17, 2025
8 checks passed
@jameshcorbett jameshcorbett deleted the add-requires-hostlist-for-down-rabbits branch January 17, 2025 19:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants