Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

broker: enable brokers to be added to running instances #5184

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

garlick
Copy link
Member

@garlick garlick commented May 19, 2023

This adds the ability to add brokers to a running instance that was bootstrapped with PMI.

There are two pieces:

  1. -Ssize=N can be specified on broker command line when the broker is bootstrapped via PMI. The PMI exchange occurs over the size set by the PMI process manager as usual, and the additional ranks are just left offline.

  2. The FLUB (FLUx Bootstrap) boot method is added. A broker can be started with -Sbroker.boot-server=URI with URI set to the flux-uri(1) URI of a running instance. It will perform a bootstrap operation with the instance and join as one of the extra ranks.

A big caveat is of course that instance resource configurations are static, so we can't yet combine R handed down from the enclosing instance for the PMI portion of the instance with dynamically probed R from the flubbed-in parts. Dummy resources for the maximum instance size have to be configured from the start.

However this is neat: you can start a batch job of N brokers but with an instance sizeof N+M, then later you can submit a job that requests M nodes and starts brokers that wire in to the other instance. This is demonstrated in one of the new sharness tests.

There are some other caveats documented in commit messages that could be addressed here or later. For example, only leaf nodes in the topology can be added currently. That could be trivially fixed with the addition of more code. And it's likely a small amount of work to extend this so that non-critical brokers that die in a regular flux instance could be replaced with new ones.

src/broker/boot_pmi.c Fixed Show fixed Hide fixed
src/broker/boot_flub.c Fixed Show fixed Hide fixed
src/broker/boot_flub.c Fixed Show fixed Hide fixed
@grondo
Copy link
Contributor

grondo commented May 19, 2023

This is groundbreaking, awesome!

Dummy resources for the maximum instance size have to be configured from the start.

Any sense of how this can be done in practice when you don't know what resources will be assigned for the flubbed in ranks? Or is this mainly meant for the case of growing an instance onto known resources?

@garlick
Copy link
Member Author

garlick commented May 19, 2023

Any sense of how this can be done in practice when you don't know what resources will be assigned for the flubbed in ranks? Or is this mainly meant for the case of growing an instance onto known resources?

I guess we'll want to have some way to update the instance's R and inform the scheduler, but that's potentially a big change affecting the resource acquisition protocol. So...more thought required? I was thinking of this proposal as adding flexibility to the plumbing, and hoped we might get some collective insight into next steps while playing with it.

But maybe it could be useful on homogeneous clusters where a node is a node is a node?

@grondo
Copy link
Contributor

grondo commented May 19, 2023

But maybe it could be useful on homogeneous clusters where a node is a node is a node?

Unfortunately, I think the hostname will even have to match, though potentially updating a hostname entry from extra1 to the real hostname in R would be easier than growing R and notifying all components that use it..

@grondo
Copy link
Contributor

grondo commented May 19, 2023

I wonder if a potential stunt would be to have a job launch with size=sizeof(cluster) and copy the cluster's R, then when nodes are flubbed in the correct rank could be assigned by hostname. Hm, except the initial allocated node would always have to be rank 0.. You are correct, more thought required. (I was mainly thinking of how we could quickly proof-of-concept a growing job on a real cluster)

@grondo
Copy link
Contributor

grondo commented May 19, 2023

If we could get the resource module to update its version of R, then we could sidestep the resource acquisition protocol for now by reloading the scheduler. Then we could tackle the issues in a couple phases.

@garlick
Copy link
Member Author

garlick commented May 19, 2023

If we could get the resource module to update its version of R, then we could sidestep the resource acquisition protocol for now by reloading the scheduler. Then we could tackle the issues in a couple phases.

Ooh! I hadn't thought of that. For a DAT maybe you could start the instance on one node with no scheduler loaded to accept job submissions, then add the rest of the nodes and load the scheduler?

Edit: oh except feasibility check wouldn't work.

@grondo
Copy link
Contributor

grondo commented May 19, 2023

Edit: oh except feasibility check wouldn't work.

It would be easy to write a more generic feasibility validator plugin that rejected obviously infeasible jobs in this case if it is useful.

@grondo
Copy link
Contributor

grondo commented May 19, 2023

Actually, in the case of a DAT the eventual resources are probably already known, so maybe this is a non-issue for that use case.

@garlick
Copy link
Member Author

garlick commented May 19, 2023

Rebased and repushed with fixups squashed, some test improvements, and added code to allow non-leaf nodes to be flubbed, and for flubbed nodes to start in any order. So for example you could start 1 node in a kary:2 topology, then flub 1023 nodes and it should wire up! The unit tests do this with 1+7 nodes and that's as far as I've tried this scaling wise.

@garlick
Copy link
Member Author

garlick commented May 19, 2023

Actually, in the case of a DAT the eventual resources are probably already known, so maybe this is a non-issue for that use case.

Hey for the DAT case, could we remove resources.R from the KVS, then reload then resource module across the (full) instance and expect it to build a new resource.R dynamically?

@grondo
Copy link
Contributor

grondo commented May 19, 2023

Hey for the DAT case, could we remove resources.R from the KVS, then reload then resource module across the (full) instance and expect it to build a new resource.R dynamically?

It doesn't seem like this is necessary for a DAT where a known set of resources are being reserved for a future time period? You can just grab the R from the reservation or whatever and use that as the resource config maybe. The part I'm not clear on is how and where the initial instance is launched and managed?

Restarting the resource module and its dependencies sounds like an interesting experiment to try to grow a job, i.e. start a job of size=4 on one node, then submit another job that is flubbed into that one. Eventually it would be nice if the R of the new job could just be merged with the R of the existing instance.

@garlick
Copy link
Member Author

garlick commented May 19, 2023

Another update which allows a lost node from a "normal" instance to be replaced with flub (and a test).

* is used (system instance).
*/
(void)attr_get (ctx->attrs, "broker.boot-server", &uri, NULL);
if (!(h = flux_open_ex (uri, 0, error)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason not to call uri_resolve(3) here? Then a JOBID or other resolvable URI could be provided directly to the broker.boot-server attribute.

Although, if the intent is to add a higher level command like flux expand OPTIONS... JOBID, perhaps that's unnecessary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this might be helpful. I'm not sure what the next steps are for tooling so may as well add that for now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A complication is that uri_resolve(3) requires FLUX_URI to be set, and the broker unsets that. I could setenv FLUX_URI to the value of the parent-uri attribute around the call but since the environment is global...eww? Maybe we should table this one for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, yeah, that's a bummer. It might be useful in the long term in case you target the URI of a job that hasn't started yet, but doesn't seem like a big deal for now.

Maybe we need a version of uri_resolve(3) that takes a uri argument and (somehow) passes that down to flux uri if that command is invoked. Meh, probably too soon to worry about it.

@garlick garlick changed the title WIP broker: enable brokers to be added to running instances broker: enable brokers to be added to running instances May 19, 2023
@garlick
Copy link
Member Author

garlick commented May 20, 2023

Another update which allows a lost node from a "normal" instance to be replaced with flub (and a test).

Whoops, I realized I forgot that if a failed broker returns to service via some other means (like systemd restarts it) then its rank would still be provisioned for flub booting. Fixed that and forced a push.

Edit: better hold off reviewing this, I'm seeing some intermittent test failures that need to be run down.

@garlick
Copy link
Member Author

garlick commented May 20, 2023

The failure I observed in the t0034-flub.t sharness test was in the last test a "connection timed out" error from flux overlay status while waiting for the instance to be come full again. I really don't quite understand what caused that and I've been unable to reproduce it since. I did push some changes to tests to reduce redundancy and to use the flux uri --local URI instead of the ssh:// one. Anyway, maybe everything is fine. 🤷

@garlick garlick force-pushed the flub branch 2 times, most recently from 5b5d65a to e47d35d Compare May 22, 2023 12:58
@garlick
Copy link
Member Author

garlick commented May 22, 2023

I woke up this morning and pushed a fix to the test, then realized I actually broke the test, and repushed. Then got coffee. Sorry for the noise.

Inception builder failed with this so I restarted

2023-05-22T14:00:50.2605129Z ./test_iostress.t timed out after 300.0s
2023-05-22T14:00:50.2605664Z ERROR: test_iostress.t - missing test plan
2023-05-22T14:00:50.2606063Z ERROR: test_iostress.t - exited with status 137 (terminated by signal 9?)

@vsoch
Copy link
Member

vsoch commented May 27, 2023

Testing this out - is there any reason this couldn't be supported for more than one external broker uri? E.g., multiple of -Sbroker.boot-server=URI or a comma separated list for that one attribute? -Sbroker.boot-server=URIA,URIB,...,URIN. Oh no don't read that last one out loud... 😆

@vsoch
Copy link
Member

vsoch commented May 27, 2023

And second question - how would -Ssize=N work with the current hack to tell Flux it has follower brokers that don't actually exist (can I do both, and assume the size=N is referring to external brokers, and the additional brokers that I add to the broker config are expected to be local brokers (in the first created cluster)?

@garlick
Copy link
Member Author

garlick commented May 27, 2023

is there any reason this couldn't be supported for more than one external broker uri?

That could be added if needed. Another related thing that could be added is some retry policy/logic in case the initial connect fails. I didn't see a need(?) so I kept it simple for now.

how would -Ssize=N work with the current hack to tell Flux it has follower brokers that don't actually exist (can I do both, and assume the size=N is referring to external brokers, and the additional brokers that I add to the broker config are expected to be local brokers (in the first created cluster)?

It would not work currently. The -Ssize=N option is only accepted for PMI bootstrapped instance where the "boostrap size" is provided by the PMI server and a way to override it is needed. When Flux bootstraps from a config file, the bootstrap size is inferred from the config file hosts array so my assumption was that method doesn't need an override.

@vsoch
Copy link
Member

vsoch commented May 27, 2023

It would not work currently. The -Ssize=N option is only accepted for PMI bootstrapped instance where the "boostrap size" is provided by the PMI server and a way to override it is needed. When Flux bootstraps from a config file, the bootstrap size is inferred from the config file hosts array so my assumption was that method doesn't need an override.

Could this be why my prototype sees the other hosts as down? I'm not sure what PMI bootstrapped means but I'm not sure I've done that.What should I try to see the hosts as up?

@garlick
Copy link
Member Author

garlick commented May 27, 2023

I'm mostly trying to understand how it works, and demonstrate that in the Flux operator. I don't have a hard use case in mind beyond what you have in mind for it (which I'm trying to figure out!)

The main use case I had in mind was growing batch/alloc jobs.

You could start a flux instance of size N with flux alloc --nodes=N --broker-opts=-Ssize=N+M. Then later on you could start M individual brokers with flux submit --nodes=M flux broker -Ssize=N+M -Sbroker.boot-server=$(flux uri --remote ID) where ID is the job ID of the alloc job. Now your allocation has all N+M nodes wired up. (There are other issues mentioned above with resources but the lower level stuff should be working)

As to how it works, first the flux instance bootstraps with PMI and the extra size argument. The instance creates placeholder hostnames for the extra ranks and provisions the FLUB rank allocator. Then each new broker independently executes the following sequence

  • contact the boot server and request one of the extra ranks and the TBON parent identity
  • contact the parent and send public key from dynamically generated certificate; receive public key of parent
  • wireup proceeds as usual

I don't know if that helps. I'm not picturing an application for this with config files but I might be just thinking about it too narrowly. However you have made me aware that at minimum, the error checking here is a bit too loosey goosey.

@vsoch
Copy link
Member

vsoch commented May 27, 2023

Gotcha. So what I should switch to trying is bringing up one MiniCluster, and then adding ranks to it. I think the problem there (in the context of the operator) is doing that ask from within the cluster doesn't interact with the operator, so you might start a size N cluster and then ask for +M, but you'd need to also ask the operator to increase too. With our current hack that just arbitrarily allows extra nodes to join/remove up to some number, we can allow the application internally or external user to make that decision, or use a horizontal pod autoscaler to do it based on a metric. I want to talk about two things here - the best example use case for the operator in this PR and then the ideal development for Flux (not scoped to this PR so I could open another issue) that we would want.

Using current FLUB in the operator

As I mentioned above, I can refactor the example to be just scaling a single MiniCluster up/down. I don't think this gains us that much because it seems (mostly) equivalent to our current hack, which is a bit better suited to the current automation we have for autoscaling. But I think my main question is - why can't this external boot server be from another MiniCluster? This is the interesting use case for me. I have two MiniClusters, each with some pods on the same network, why can't I recruit them? This is the use case I'd like to talk through because it would mean I can combine workers from different MiniClusters (which is cool and we don't currently support). Another idea to test there would be with a simple (albeit manual) flux proxy.

Autoscaling Flux based on what Flux needs.

We currently have demos for autoscaling flux internally (request by the application) and externally (request by the user) and automated based on resources (horizontal pod autoscaling). However, for the last I've only tried autoscale/v1, which does it based on CPU %, and I haven't tried the autoscale/v2 that allows for custom metrics. What we could do here is define a custom metric that is something directly coming from Flux! E.g., I can imagine ideally:

  • FLUB allows flux to grant more workers than it has (and our current hack allows this too, it doesn't matter which we use)
  • When a user requests +M via a submit, there is a metric that Flux can send out to say "I need more nodes!"
  • The autoscaler can hear / detect this signal, and allocate the nodes (up to some limit)
  • The job submit turns into a job running when the resources are available
  • The same signal is used to indicate "I don't need this many jobs anymore!" and it scales back down.

I haven't tested this yet, but I think:

  • autoscaler/v2 adds support for cpu plus memory, and importantly, custom metrics!
  • custom metrics is yet another Kubernetes api that will collect metrics (even something simple like "Flux has this many nodes but needs this many" and deliver it to the autoscaler. The autoscaler scales up, the minicluster size increases and we have the larger or smaller cluster.

Doing some digging, a common way to get metrics from pods is using Prometheus, and actually it's so popular there is an adapter that already exists. So what would that mean? It would mean that we need to find a way for Flux to export promethesus metrics (maybe a plugin? There do seem to be libraries, e.g., https://github.com/jupp0r/prometheus-cpp) and then have the prometheus adapter consume that, register it as a metric in the pod horizontal autoscaler, and boum it works. Then we would have the scaling of the cluster determined not by CPU/memory or manual request by a user or application, but by the needs for resources in the queue. TLDR:

  1. Write a Flux plugin that exports prometheus metrics
  2. Combine that with the Flux Operator + horizontal pod autoscaler with a custom metric (autoscaler/v2)
  3. Install the HPC/metrics-server and submit jobs, asking for more than we have (hope it works!)

Pinging @asarkar-parsys and @milroy to stay in the loop (and no need to respond on vacation/weekend)! I think the first step, for Flux, is test writing a plugin that can export some number of metrics that are important for scaling.

@garlick
Copy link
Member Author

garlick commented May 28, 2023

Nice 👍! I think the autoscaling discussion is orthogonal to this PR topic though (as you alluded) so maybe it belongs in an operator discussion not here?

I can probably make it so that -Ssize=N works in a non-PMI bootstrapped instance. Maybe that should be included here and then there will be something further to play with in our bag of tricks.

I had been sort of hesitant because we declare that "thou shalt use the same config files for all brokers within a flux instance" and if that remains true, then all brokers could already have the bootstrap information and don't need to acquire it dynamically. However that declaration can remain true if the extra brokers are left out of the configuration, if that turns out to be convenient for some reason.

It's probably better design to make the size override work in all cases anyway and not leave people wondering why it works here but not there.

@vsoch
Copy link
Member

vsoch commented May 28, 2023

I can probably make it so that -Ssize=N works in a non-PMI bootstrapped instance. Maybe that should be included here and then there will be something further to play with in our bag of tricks.

That sounds perfect! And I'll test it when it's done.

Nice +1! I think the autoscaling discussion is orthogonal to this PR topic though (as you alluded) so maybe it belongs in an operator discussion not here?

I agree - I'm sorry I wanted to give the full context of what I was thinking, albeit it wasn't the best place.

It's probably better design to make the size override work in all cases anyway and not leave people wondering why it works here but not there.

Also agree!

@garlick
Copy link
Member Author

garlick commented May 28, 2023

I agree - I'm sorry I wanted to give the full context of what I was thinking, albeit it wasn't the best place.

Thanks and no problem at all, I appreciate all the effort you put into explaining!

@vsoch
Copy link
Member

vsoch commented May 28, 2023

@garlick do you have a suggestion for docs or similar for how to make a flux plugin? I found https://github.com/flux-framework/flux-docs/blob/307a60d8a73dde4c6f644d0d5e62a1891444414e/tutorials/integrations/stats.rst an "integration" which seems similar in concept (flux interacting with a service) and I found the reference to the envar here and it looks like it's part of libflux. There also seem to be other plugins scattered inside the src directory (job validator, ingest, etc). And if I remember they load in rc files like here? So what would be the best way to accomplish something like modload 0 prometheus, where prometheus is my plugin, and then turn this on to (maybe) do a basic load of a Flux handle in the context to expose some set of resource metrics?

I suck at C++ and it's really scary but this might be a good opportunity to push myself into a bit! 😆

@garlick
Copy link
Member Author

garlick commented May 28, 2023

do you have a suggestion for docs or similar for how to make a flux plugin?

It depends on what kind of plugin is needed. I would think resource utilization or queue length or something like that would be the sort of metric you'd want for autoscaling? (Could we move this to the autoscaling discussion?)

@vsoch
Copy link
Member

vsoch commented May 28, 2023

okay moved here: #5214 Ty!

@garlick
Copy link
Member Author

garlick commented Mar 1, 2024

Rebased on current master and fixed various conflicts.

Problem: internally generated curve certs are not named,
so overlay_cert_name() can return NULL, but a name is required
when authorizing a cert.

This API inconsistency results in extra code and confusion
when implementing a new boot method.

Use the rank as the name for internally generated certs.
Problem: there is no way to bootstrap a flux instance using PMI
with ranks (initially) missing.

Allow the 'size' broker attribute to be set on the command line.
If set to a value greater than the PMI size, perform the PMI
exchange as usual with the PMI size, but configure the overlay
topology with the additional ranks.

Since 'hostlist' is an immutable attribute that is expected to
be set by the bootstrap implementation, set it to include placeholders
for the ranks that haven't connected yet "extra[0-N]" so we
get something other than "(null)" in the logs.
Problem: the code block that selects which boot method to use
is not very clear.

Simplify code block so that the default path is clear and
adding a boot method won't increase complexity.
Problem: there is no way to add brokers to an instance
that has extra slots available.

Add support for FLUB, the FLUx Bootstrap protocol, used when
the broker is started with
  broker.boot-server=<uri>

The bootstrap protocol consists of two RPCs:

1) overlay.flub-getinfo, which requests the allocation of an
available rank from rank 0 of the instance that is being extended,
and also retrieves the instance size and some broker attributes.

2) overlay.flub-kex, which exchanges public keys with the new
rank's TBON parent and obtains the parent's TBON URI.

Assumptions:
- all ranks have the same topology configuration

Limitations (for now):
- hostnames will be logged as extra[0-N]
- a broker rank cannot be re-allocated to a new broker
- a broker cannot replace one that failed in a regular instance
- dummy resources for the max size of the instance must be configured
Problem: the flub bootstrap method requires broker services.

Add the following services (instance owner only):

overlay.flub-getinfo
  (rank 0 only) Allocate an unused rank from rank 0 and also
  return size and misc. broker attributes to be set in the new
  broker

overlay.flub-kex
  (peer rank) Exchange public keys with the TBON parent and obtain
  its zeromq URI.

Add overlay_flub_provision() which is called by boot_pmi.c when extra
ranks are configured, making those ranks available for allocation.
Problem: there is no test coverage for broker bootstrap
with a PMI size less than the actual size.

Add some tests.
Problem: there is no test coverage for adding brokers to
a flux instance.

Add some tests.
Problem: there is no way to replace a node in Flux instance
that goes down.

Call overlay_flub_provision () when a rank goes offline
so that the flub allocator can allocate its rank to a replacement.
Unprovision ranks when they return to online.
@garlick
Copy link
Member Author

garlick commented Aug 15, 2024

I saw this mentioned in #6213 and so rebased it on current master.

I wonder if we could get this in shape to be a merge candidate without necessarily handling the higher level resource problems? some next steps might be

  • consider if there is a way to make the hostlist and size more dynamic
  • consider the scaling issue in the provisioning step (ssh to rank 0) brought up in Interested in elastic deployments on Slurm #6213
  • do something sensible when booting from a config file (see conversation above)
  • document the protocol in an RFC since it is security significant

Then we could open an issue on growing/shrinking the resource set and work that separately.

Copy link

codecov bot commented Aug 15, 2024

Codecov Report

Attention: Patch coverage is 67.08333% with 79 lines in your changes missing coverage. Please review.

Project coverage is 83.27%. Comparing base (cfc8b71) to head (0ba34b1).
Report is 3 commits behind head on master.

Files Patch % Lines
src/broker/overlay.c 46.59% 47 Missing ⚠️
src/broker/boot_flub.c 72.34% 26 Missing ⚠️
src/broker/boot_pmi.c 80.64% 6 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5184      +/-   ##
==========================================
- Coverage   83.64%   83.27%   -0.38%     
==========================================
  Files         512      522      +10     
  Lines       83220    85429    +2209     
==========================================
+ Hits        69609    71139    +1530     
- Misses      13611    14290     +679     
Files Coverage Δ
src/broker/broker.c 77.29% <100.00%> (-0.60%) ⬇️
src/broker/state_machine.c 82.22% <100.00%> (-0.91%) ⬇️
src/broker/boot_pmi.c 68.13% <80.64%> (+0.86%) ⬆️
src/broker/boot_flub.c 72.34% <72.34%> (ø)
src/broker/overlay.c 80.99% <46.59%> (-3.18%) ⬇️

... and 188 files with indirect coverage changes

garlick added a commit to garlick/flux-rfc that referenced this pull request Aug 19, 2024
Problem: there is no documentation for the FLUB protocol
proposed in flux-framework/flux-core#5184

Add a new RFC.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants