You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This means the most number of vCPUs that a single deployment can request is 512. This is a fairly sever limitation when running AI training workloads that can sometimes need more CPUs. Similar issue with memory - we limit to 1024 (and AI workloads need to store large amounts of data on memory).
These limits are akash specific and the base k8s supports higher limits. These limits were introduced when we launched the initial mainnet and were put in place as a safeguard against misuse. Now that we have the ability to whitelist deployment wallets per provider (to protect against misuse), I think it is safe to increase these limits.
@troian is currently researching what are good new limits to set and will update this issue when he has a recommendation but the immediate need is for a customer to be able to request 1024 vCPUs and 4096GB of memory
The text was updated successfully, but these errors were encountered:
anilmurty
changed the title
[Feature Request] Increase max limits (per deployment) for CPU, memory and storage
Increase max limits (per deployment) for CPU, memory and storage
Nov 7, 2023
Plan to set MaxUnitCPU to 384
Plan to set MaxGroupCPU to MaxUnitCPU*MaxGroupCount (were not doing this in the past)
Separately: we should also consider increasing the number of volumes that can be mounted per node (we currently support one persistent and one ephemeral). Can take this up as a separate issue.
The challenge with this is that while it doesn't require a network upgrade
@brewsterdrinkwater - we will need to validators to have them upgrade ahead of the mainnet upgrade
Akash Deployments currently allow a max of the following (ref:https://github.com/akash-network/akash-api/blob/ea71fbd0bee740198034bf1b0261c90baea88be0/go/node/deployment/v1beta3/validation_config.go#L45):
This means the most number of vCPUs that a single deployment can request is 512. This is a fairly sever limitation when running AI training workloads that can sometimes need more CPUs. Similar issue with memory - we limit to 1024 (and AI workloads need to store large amounts of data on memory).
These limits are akash specific and the base k8s supports higher limits. These limits were introduced when we launched the initial mainnet and were put in place as a safeguard against misuse. Now that we have the ability to whitelist deployment wallets per provider (to protect against misuse), I think it is safe to increase these limits.
@troian is currently researching what are good new limits to set and will update this issue when he has a recommendation but the immediate need is for a customer to be able to request 1024 vCPUs and 4096GB of memory
The text was updated successfully, but these errors were encountered: