Design Problems for Fluence #68

vsoch · 2024-02-19T23:03:03Z

I think I've been working on this over 30 hours this weekend and want to write down some concerns I have about #61, which is still not fully working with the new "bulk submit" model.

Resources not accounted for: Fluence can only account for resources on nodes at the init time, and doesn't account for resources that are created with the default scheduler (primarily before fluence comes up). E.g., on smaller sizes I would often see fluence assign nodes to pods, the pods then accept, but then (for some reason) the work didn't wind up there. I suspect it was rejected by the kubelet. I'm not sure the state after that, because fluence thinks the pod is running there but it is not. In practice I think it leads to a stop or clog.
Reliance on state fluence stores group job ids in an internal dictionary. This means that if the scheduler restarts (which happens) we lose a record of them. Further, fluence comes up and is not able to reliably do a new mapping between the existing pod groups and node assignments. I'm not even sure how to think about this one aside from avoiding the restart case - it seems like a failure case (again related to state).
PodGroup decoupling For smaller or controlled runs, it works nicely to see a job run, the job complete, and then the reconciler watching pods to see that the group number completed / failed is == the min members, and delete the pod group. For scaled runs (where there is likely more stress on watching pods, even a kubectl get pods can take 10 seconds) I'm worried that the PodGroup logic can get decoupled from the fluence logic, meaning that the PodGroup is cleaned up, and (for some reason) we then do another AskFlux and allocate again. I'm not sure I've seen this happen but based on the design I think it might be possible.
Pod recreation The main issue I'm seeing now (that I don't understand) is that nodes are allocated for a group, but then for some reason, the pods change. But fluence already has assigned (and lost the nodes from its list). This was helped a bit by adding back in the cancel (if AskFlux happens again) but it still clogs.
Unit of operation The scheduler works on the unit of pods. We need to work in the unit of groups. We are getting around that via the PodGroup, and indeed the MicroSecond timestamps help, but we still have edge cases that are hard to handle like update / delete of a pod, because those events act on an entire group (and maybe it is an erroneous event for the pod and we should not). I don't know how to handle that right now but want to point out the design is problematic.

On a high level, we are trying to implement a model that has state into a framework that is largely against that. We are also trying to enforce the idea of a group of pods in a model where the unit is a single pod. For all of the above, I think our model works OK for small, more controlled cases, but we run into trouble for submission en-masse (as I'm trying to do). My head is spinning a bit from all these design problems and probably I need to step away for a bit. Another set / sets of eyes would help too.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Design Problems for Fluence #68

Design Problems for Fluence #68

vsoch commented Feb 19, 2024 •

edited

Loading

Design Problems for Fluence #68

Design Problems for Fluence #68

Comments

vsoch commented Feb 19, 2024 • edited Loading

vsoch commented Feb 19, 2024 •

edited

Loading