Multiple pods per job #2910
Labels
component/scheduling
Armada Server, Scheduler and Scheduler Injester
question
Further information is requested
type/design
Design / Architecture suggestions
Is your feature request related to a problem? Please describe.
I was recently testing armada job services and had a question: the API currently rejects a JobSubmitRequestItem when more than one pod is specified with the rationale presented in the code here.
This seems to imply the need at present to run pods as separate jobs and coordinate service discovery across jobs, but adds a complication: services are named
armada-<job-id>-<pod-index>-<service-type>
but job id is a value that is not known at submission time. There are several ways to work around this to find the job id at runtime when pods are all running under separate job ids, but if I had the choice my preference would be to run multiple pod specs within the same job so that service discovery can be done in the way that the APIs seem to be suggesting that I do it.The need to do out of band k8s lookups to find other "gang members" is not ideal because it 1) adds some level of load to the executor cluster k8s API where the gang is running, and 2) adds a small amount of complexity to gang application code to perform the out of band k8s lookup. All of this is to just find the job id to form a service name, and perform a DNS lookup once formed. If multiple pods could be submitted within the same JobSubmitRequestItem, the out of band k8s lookup wouldn't be necessary since each pod could use its own job id for service discovery, thus skipping directly to the DNS lookup. This way there wouldn't be any load or code complexity induced by out of band k8s lookup.
Describe the solution you'd like
The solution I would like is to submit multiple pod specs in a single JobSubmitRequestItem. Are there any roadmap plans for supporting multiple podSpecs per job? The comment in code I linked to above seems to express concern about ingress setup when multiple pods are part of the same job?
Describe alternatives you've considered
Another alternative I have considered is to write my own service controller that will watch pods and create services which use gang id instead of job id, since every member of the gang knows this value.
┆Issue is synchronized with this Jira Task by Unito
The text was updated successfully, but these errors were encountered: