Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase max limits (per deployment) for CPU, memory and storage #140

Closed
anilmurty opened this issue Nov 7, 2023 · 2 comments
Closed

Increase max limits (per deployment) for CPU, memory and storage #140

anilmurty opened this issue Nov 7, 2023 · 2 comments
Assignees

Comments

@anilmurty
Copy link

Akash Deployments currently allow a max of the following (ref:https://github.com/akash-network/akash-api/blob/ea71fbd0bee740198034bf1b0261c90baea88be0/go/node/deployment/v1beta3/validation_config.go#L45):

MaxUnitCPU:     256 * 1000, // 256 CPUs
MaxUnitGPU:     100,
MaxUnitMemory:  512 * unit.Gi, // 512 Gi
MaxUnitStorage: 32 * unit.Ti,  // 32 Ti
MaxUnitCount:   50,
MaxUnitPrice:   10000000, // 10akt

MinUnitCPU:     10,
MinUnitGPU:     0,
MinUnitMemory:  unit.Mi,
MinUnitStorage: 5 * unit.Mi,
MinUnitCount:   1,

MaxGroupCount: 20,
MaxGroupUnits: 20,

MaxGroupCPU:     512 * 1000,
MaxGroupGPU:     512,
MaxGroupMemory:  1024 * unit.Gi,
MaxGroupStorage: 32 * unit.Ti,

This means the most number of vCPUs that a single deployment can request is 512. This is a fairly sever limitation when running AI training workloads that can sometimes need more CPUs. Similar issue with memory - we limit to 1024 (and AI workloads need to store large amounts of data on memory).

These limits are akash specific and the base k8s supports higher limits. These limits were introduced when we launched the initial mainnet and were put in place as a safeguard against misuse. Now that we have the ability to whitelist deployment wallets per provider (to protect against misuse), I think it is safe to increase these limits.

@troian is currently researching what are good new limits to set and will update this issue when he has a recommendation but the immediate need is for a customer to be able to request 1024 vCPUs and 4096GB of memory

@anilmurty anilmurty moved this to In Progress (prioritized) in Core Product and Engineering Roadmap Nov 7, 2023
@anilmurty anilmurty changed the title [Feature Request] Increase max limits (per deployment) for CPU, memory and storage Increase max limits (per deployment) for CPU, memory and storage Nov 7, 2023
@anilmurty
Copy link
Author

anilmurty commented Nov 7, 2023

Nov 7:

Plan to set MaxUnitCPU to 384
Plan to set MaxGroupCPU to MaxUnitCPU*MaxGroupCount (were not doing this in the past)

Separately: we should also consider increasing the number of volumes that can be mounted per node (we currently support one persistent and one ephemeral). Can take this up as a separate issue.

The challenge with this is that while it doesn't require a network upgrade

@brewsterdrinkwater - we will need to validators to have them upgrade ahead of the mainnet upgrade

@troian
Copy link
Member

troian commented Nov 8, 2023

released in node v0.26.2

@troian troian closed this as completed Nov 8, 2023
@github-project-automation github-project-automation bot moved this from In Progress (prioritized) to Released (in Prod) in Core Product and Engineering Roadmap Nov 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Released (in Prod)
Development

No branches or pull requests

3 participants