-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
broker: enable brokers to be added to running instances #5184
base: master
Are you sure you want to change the base?
Conversation
This is groundbreaking, awesome!
Any sense of how this can be done in practice when you don't know what resources will be assigned for the flubbed in ranks? Or is this mainly meant for the case of growing an instance onto known resources? |
I guess we'll want to have some way to update the instance's R and inform the scheduler, but that's potentially a big change affecting the resource acquisition protocol. So...more thought required? I was thinking of this proposal as adding flexibility to the plumbing, and hoped we might get some collective insight into next steps while playing with it. But maybe it could be useful on homogeneous clusters where a node is a node is a node? |
Unfortunately, I think the hostname will even have to match, though potentially updating a hostname entry from |
I wonder if a potential stunt would be to have a job launch with |
If we could get the resource module to update its version of R, then we could sidestep the resource acquisition protocol for now by reloading the scheduler. Then we could tackle the issues in a couple phases. |
Ooh! I hadn't thought of that. For a DAT maybe you could start the instance on one node with no scheduler loaded to accept job submissions, then add the rest of the nodes and load the scheduler? Edit: oh except feasibility check wouldn't work. |
It would be easy to write a more generic feasibility validator plugin that rejected obviously infeasible jobs in this case if it is useful. |
Actually, in the case of a DAT the eventual resources are probably already known, so maybe this is a non-issue for that use case. |
Rebased and repushed with fixups squashed, some test improvements, and added code to allow non-leaf nodes to be flubbed, and for flubbed nodes to start in any order. So for example you could start 1 node in a kary:2 topology, then flub 1023 nodes and it should wire up! The unit tests do this with 1+7 nodes and that's as far as I've tried this scaling wise. |
Hey for the DAT case, could we remove |
It doesn't seem like this is necessary for a DAT where a known set of resources are being reserved for a future time period? You can just grab the R from the reservation or whatever and use that as the resource config maybe. The part I'm not clear on is how and where the initial instance is launched and managed? Restarting the resource module and its dependencies sounds like an interesting experiment to try to grow a job, i.e. start a job of size=4 on one node, then submit another job that is flubbed into that one. Eventually it would be nice if the R of the new job could just be merged with the R of the existing instance. |
Another update which allows a lost node from a "normal" instance to be replaced with flub (and a test). |
* is used (system instance). | ||
*/ | ||
(void)attr_get (ctx->attrs, "broker.boot-server", &uri, NULL); | ||
if (!(h = flux_open_ex (uri, 0, error))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason not to call uri_resolve(3)
here? Then a JOBID or other resolvable URI could be provided directly to the broker.boot-server
attribute.
Although, if the intent is to add a higher level command like flux expand OPTIONS... JOBID
, perhaps that's unnecessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like this might be helpful. I'm not sure what the next steps are for tooling so may as well add that for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A complication is that uri_resolve(3)
requires FLUX_URI to be set, and the broker unsets that. I could setenv FLUX_URI to the value of the parent-uri
attribute around the call but since the environment is global...eww? Maybe we should table this one for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, yeah, that's a bummer. It might be useful in the long term in case you target the URI of a job that hasn't started yet, but doesn't seem like a big deal for now.
Maybe we need a version of uri_resolve(3)
that takes a uri
argument and (somehow) passes that down to flux uri
if that command is invoked. Meh, probably too soon to worry about it.
Whoops, I realized I forgot that if a failed broker returns to service via some other means (like systemd restarts it) then its rank would still be provisioned for flub booting. Fixed that and forced a push. Edit: better hold off reviewing this, I'm seeing some intermittent test failures that need to be run down. |
The failure I observed in the |
5b5d65a
to
e47d35d
Compare
I woke up this morning and pushed a fix to the test, then realized I actually broke the test, and repushed. Then got coffee. Sorry for the noise. Inception builder failed with this so I restarted
|
Testing this out - is there any reason this couldn't be supported for more than one external broker uri? E.g., multiple of |
And second question - how would |
That could be added if needed. Another related thing that could be added is some retry policy/logic in case the initial connect fails. I didn't see a need(?) so I kept it simple for now.
It would not work currently. The |
Could this be why my prototype sees the other hosts as down? I'm not sure what PMI bootstrapped means but I'm not sure I've done that.What should I try to see the hosts as up? |
The main use case I had in mind was growing batch/alloc jobs. You could start a flux instance of size N with As to how it works, first the flux instance bootstraps with PMI and the extra size argument. The instance creates placeholder hostnames for the extra ranks and provisions the FLUB rank allocator. Then each new broker independently executes the following sequence
I don't know if that helps. I'm not picturing an application for this with config files but I might be just thinking about it too narrowly. However you have made me aware that at minimum, the error checking here is a bit too loosey goosey. |
Gotcha. So what I should switch to trying is bringing up one MiniCluster, and then adding ranks to it. I think the problem there (in the context of the operator) is doing that ask from within the cluster doesn't interact with the operator, so you might start a size N cluster and then ask for +M, but you'd need to also ask the operator to increase too. With our current hack that just arbitrarily allows extra nodes to join/remove up to some number, we can allow the application internally or external user to make that decision, or use a horizontal pod autoscaler to do it based on a metric. I want to talk about two things here - the best example use case for the operator in this PR and then the ideal development for Flux (not scoped to this PR so I could open another issue) that we would want. Using current FLUB in the operatorAs I mentioned above, I can refactor the example to be just scaling a single MiniCluster up/down. I don't think this gains us that much because it seems (mostly) equivalent to our current hack, which is a bit better suited to the current automation we have for autoscaling. But I think my main question is - why can't this external boot server be from another MiniCluster? This is the interesting use case for me. I have two MiniClusters, each with some pods on the same network, why can't I recruit them? This is the use case I'd like to talk through because it would mean I can combine workers from different MiniClusters (which is cool and we don't currently support). Another idea to test there would be with a simple (albeit manual) flux proxy. Autoscaling Flux based on what Flux needs.We currently have demos for autoscaling flux internally (request by the application) and externally (request by the user) and automated based on resources (horizontal pod autoscaling). However, for the last I've only tried autoscale/v1, which does it based on CPU %, and I haven't tried the autoscale/v2 that allows for custom metrics. What we could do here is define a custom metric that is something directly coming from Flux! E.g., I can imagine ideally:
I haven't tested this yet, but I think:
Doing some digging, a common way to get metrics from pods is using Prometheus, and actually it's so popular there is an adapter that already exists. So what would that mean? It would mean that we need to find a way for Flux to export promethesus metrics (maybe a plugin? There do seem to be libraries, e.g., https://github.com/jupp0r/prometheus-cpp) and then have the prometheus adapter consume that, register it as a metric in the pod horizontal autoscaler, and boum it works. Then we would have the scaling of the cluster determined not by CPU/memory or manual request by a user or application, but by the needs for resources in the queue. TLDR:
Pinging @asarkar-parsys and @milroy to stay in the loop (and no need to respond on vacation/weekend)! I think the first step, for Flux, is test writing a plugin that can export some number of metrics that are important for scaling. |
Nice 👍! I think the autoscaling discussion is orthogonal to this PR topic though (as you alluded) so maybe it belongs in an operator discussion not here? I can probably make it so that -Ssize=N works in a non-PMI bootstrapped instance. Maybe that should be included here and then there will be something further to play with in our bag of tricks. I had been sort of hesitant because we declare that "thou shalt use the same config files for all brokers within a flux instance" and if that remains true, then all brokers could already have the bootstrap information and don't need to acquire it dynamically. However that declaration can remain true if the extra brokers are left out of the configuration, if that turns out to be convenient for some reason. It's probably better design to make the size override work in all cases anyway and not leave people wondering why it works here but not there. |
That sounds perfect! And I'll test it when it's done.
I agree - I'm sorry I wanted to give the full context of what I was thinking, albeit it wasn't the best place.
Also agree! |
Thanks and no problem at all, I appreciate all the effort you put into explaining! |
@garlick do you have a suggestion for docs or similar for how to make a flux plugin? I found https://github.com/flux-framework/flux-docs/blob/307a60d8a73dde4c6f644d0d5e62a1891444414e/tutorials/integrations/stats.rst an "integration" which seems similar in concept (flux interacting with a service) and I found the reference to the envar here and it looks like it's part of libflux. There also seem to be other plugins scattered inside the src directory (job validator, ingest, etc). And if I remember they load in rc files like here? So what would be the best way to accomplish something like I suck at C++ and it's really scary but this might be a good opportunity to push myself into a bit! 😆 |
It depends on what kind of plugin is needed. I would think resource utilization or queue length or something like that would be the sort of metric you'd want for autoscaling? (Could we move this to the autoscaling discussion?) |
okay moved here: #5214 Ty! |
Rebased on current master and fixed various conflicts. |
Problem: internally generated curve certs are not named, so overlay_cert_name() can return NULL, but a name is required when authorizing a cert. This API inconsistency results in extra code and confusion when implementing a new boot method. Use the rank as the name for internally generated certs.
Problem: there is no way to bootstrap a flux instance using PMI with ranks (initially) missing. Allow the 'size' broker attribute to be set on the command line. If set to a value greater than the PMI size, perform the PMI exchange as usual with the PMI size, but configure the overlay topology with the additional ranks. Since 'hostlist' is an immutable attribute that is expected to be set by the bootstrap implementation, set it to include placeholders for the ranks that haven't connected yet "extra[0-N]" so we get something other than "(null)" in the logs.
Problem: the code block that selects which boot method to use is not very clear. Simplify code block so that the default path is clear and adding a boot method won't increase complexity.
Problem: there is no way to add brokers to an instance that has extra slots available. Add support for FLUB, the FLUx Bootstrap protocol, used when the broker is started with broker.boot-server=<uri> The bootstrap protocol consists of two RPCs: 1) overlay.flub-getinfo, which requests the allocation of an available rank from rank 0 of the instance that is being extended, and also retrieves the instance size and some broker attributes. 2) overlay.flub-kex, which exchanges public keys with the new rank's TBON parent and obtains the parent's TBON URI. Assumptions: - all ranks have the same topology configuration Limitations (for now): - hostnames will be logged as extra[0-N] - a broker rank cannot be re-allocated to a new broker - a broker cannot replace one that failed in a regular instance - dummy resources for the max size of the instance must be configured
Problem: the flub bootstrap method requires broker services. Add the following services (instance owner only): overlay.flub-getinfo (rank 0 only) Allocate an unused rank from rank 0 and also return size and misc. broker attributes to be set in the new broker overlay.flub-kex (peer rank) Exchange public keys with the TBON parent and obtain its zeromq URI. Add overlay_flub_provision() which is called by boot_pmi.c when extra ranks are configured, making those ranks available for allocation.
Problem: there is no test coverage for broker bootstrap with a PMI size less than the actual size. Add some tests.
Problem: there is no test coverage for adding brokers to a flux instance. Add some tests.
Problem: there is no way to replace a node in Flux instance that goes down. Call overlay_flub_provision () when a rank goes offline so that the flub allocator can allocate its rank to a replacement. Unprovision ranks when they return to online.
I saw this mentioned in #6213 and so rebased it on current master. I wonder if we could get this in shape to be a merge candidate without necessarily handling the higher level resource problems? some next steps might be
Then we could open an issue on growing/shrinking the resource set and work that separately. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #5184 +/- ##
==========================================
- Coverage 83.64% 83.27% -0.38%
==========================================
Files 512 522 +10
Lines 83220 85429 +2209
==========================================
+ Hits 69609 71139 +1530
- Misses 13611 14290 +679
|
Problem: there is no documentation for the FLUB protocol proposed in flux-framework/flux-core#5184 Add a new RFC.
This adds the ability to add brokers to a running instance that was bootstrapped with PMI.
There are two pieces:
-Ssize=N
can be specified on broker command line when the broker is bootstrapped via PMI. The PMI exchange occurs over the size set by the PMI process manager as usual, and the additional ranks are just leftoffline
.The FLUB (FLUx Bootstrap) boot method is added. A broker can be started with
-Sbroker.boot-server=URI
with URI set to theflux-uri(1)
URI of a running instance. It will perform a bootstrap operation with the instance and join as one of the extra ranks.A big caveat is of course that instance resource configurations are static, so we can't yet combine R handed down from the enclosing instance for the PMI portion of the instance with dynamically probed R from the flubbed-in parts. Dummy resources for the maximum instance size have to be configured from the start.
However this is neat: you can start a batch job of N brokers but with an instance sizeof N+M, then later you can submit a job that requests M nodes and starts brokers that wire in to the other instance. This is demonstrated in one of the new sharness tests.
There are some other caveats documented in commit messages that could be addressed here or later. For example,
only leaf nodes in the topology can be added currently. That could be trivially fixed with the addition of more code. And it's likely a small amount of work to extend this so that non-critical brokers that die in a regular flux instance could be replaced with new ones.