Increase max limits (per deployment) for CPU, memory and storage #140

anilmurty · 2023-11-07T17:27:33Z

Akash Deployments currently allow a max of the following (ref:https://github.com/akash-network/akash-api/blob/ea71fbd0bee740198034bf1b0261c90baea88be0/go/node/deployment/v1beta3/validation_config.go#L45):

MaxUnitCPU:     256 * 1000, // 256 CPUs
MaxUnitGPU:     100,
MaxUnitMemory:  512 * unit.Gi, // 512 Gi
MaxUnitStorage: 32 * unit.Ti,  // 32 Ti
MaxUnitCount:   50,
MaxUnitPrice:   10000000, // 10akt

MinUnitCPU:     10,
MinUnitGPU:     0,
MinUnitMemory:  unit.Mi,
MinUnitStorage: 5 * unit.Mi,
MinUnitCount:   1,

MaxGroupCount: 20,
MaxGroupUnits: 20,

MaxGroupCPU:     512 * 1000,
MaxGroupGPU:     512,
MaxGroupMemory:  1024 * unit.Gi,
MaxGroupStorage: 32 * unit.Ti,

This means the most number of vCPUs that a single deployment can request is 512. This is a fairly sever limitation when running AI training workloads that can sometimes need more CPUs. Similar issue with memory - we limit to 1024 (and AI workloads need to store large amounts of data on memory).

These limits are akash specific and the base k8s supports higher limits. These limits were introduced when we launched the initial mainnet and were put in place as a safeguard against misuse. Now that we have the ability to whitelist deployment wallets per provider (to protect against misuse), I think it is safe to increase these limits.

@troian is currently researching what are good new limits to set and will update this issue when he has a recommendation but the immediate need is for a customer to be able to request 1024 vCPUs and 4096GB of memory

The text was updated successfully, but these errors were encountered:

anilmurty · 2023-11-07T18:42:46Z

Nov 7:

Plan to set MaxUnitCPU to 384
Plan to set MaxGroupCPU to MaxUnitCPU*MaxGroupCount (were not doing this in the past)

Separately: we should also consider increasing the number of volumes that can be mounted per node (we currently support one persistent and one ephemeral). Can take this up as a separate issue.

The challenge with this is that while it doesn't require a network upgrade

@brewsterdrinkwater - we will need to validators to have them upgrade ahead of the mainnet upgrade

troian · 2023-11-08T18:41:32Z

released in node v0.26.2

anilmurty assigned troian and andy108369 Nov 7, 2023

anilmurty added this to Core Product and Engineering Roadmap Nov 7, 2023

anilmurty moved this to In Progress (prioritized) in Core Product and Engineering Roadmap Nov 7, 2023

anilmurty unassigned andy108369 Nov 7, 2023

anilmurty changed the title ~~[Feature Request] Increase max limits (per deployment) for CPU, memory and storage~~ Increase max limits (per deployment) for CPU, memory and storage Nov 7, 2023

troian closed this as completed Nov 8, 2023

github-project-automation bot moved this from In Progress (prioritized) to Released (in Prod) in Core Product and Engineering Roadmap Nov 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase max limits (per deployment) for CPU, memory and storage #140

Increase max limits (per deployment) for CPU, memory and storage #140

anilmurty commented Nov 7, 2023

anilmurty commented Nov 7, 2023 •

edited

Loading

troian commented Nov 8, 2023

Increase max limits (per deployment) for CPU, memory and storage #140

Increase max limits (per deployment) for CPU, memory and storage #140

Comments

anilmurty commented Nov 7, 2023

anilmurty commented Nov 7, 2023 • edited Loading

troian commented Nov 8, 2023

anilmurty commented Nov 7, 2023 •

edited

Loading