Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] - Longhorn instance-manager only uses max 2 CPU #9722

Open
pwurbs opened this issue Oct 28, 2024 · 0 comments
Open

[BUG] - Longhorn instance-manager only uses max 2 CPU #9722

pwurbs opened this issue Oct 28, 2024 · 0 comments
Labels
kind/bug require/backport Require backport. Only used when the specific versions to backport have not been definied. require/qa-review-coverage Require QA to review coverage

Comments

@pwurbs
Copy link

pwurbs commented Oct 28, 2024

Describe the bug

There is an app with rather high IO (500-1000 IOPS). If there is a Longhorn background task (like replica rebuild or snapshot deletion), then the instance-manager pod running at the relevant worker node is under heavy load and consumes 1-2 CPU.
There are 2 issues:

  • The background task is quite slow. E.g. snapshot deletion (merge 250GB into 1.8TB) takes 15hrs. For the same snapshot deletion on another node without IO intense applications only 1hr is needed
  • IO intense application suffers due to overloaded instance-manager, IO utilization for the logical Longhorn /dev/sdm device shows nearly 100%.

The instance-manager pod seems to be the bottleneck. The actual node disk doesn't seem to be the bottleneck (about 20% utilization).
The worker node provides 16 CPU overall. Max 2 CPUs are used by the instance-manager, max 1 CPU by other pods, 13 CPU are mostly free.
Now the bug:
If the instance-manager seems to be so busy (leading to over utilization of the logical sdm device and quite slow background tasks), why doesn't the instance-manager pod use more than 2 CPU? The CPU request value (danger zone) is kept as default (1.92 CPU), no K8S limit configured.
Is there any limitation within the instance-manager software not being able to scale to use more CPU?

Screenshot shows the CPU graph of the instance-manager pod.
Increased CPU load starts at about 05:00 when the snapshot merge started. No more than 2 CPU is used
image

This screenshot shows the disk utilization over time:
logical "sdm" device nearly 100%
"physical" node device "vdc" about 20%
image

To Reproduce

  • Deploy an IO-intense application
  • Start a Longhorn background task
  • Watch the instance-manager pod metrics
  • Watch the slow progress of background tasks

Expected behavior

instance-manager should use more than 2 CPU when required and if available on the worker node

Support bundle for troubleshooting

Environment

  • Longhorn version: 1.5.5
  • Impacted volume (PV):
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Helm
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: RKE
    • Number of control plane nodes in the cluster: 3
    • Number of worker nodes in the cluster: 6
  • Node config
    • OS type and version: AlmaLinux 8.10 (Cerulean Leopard)
    • Kernel version: 4.18.0-553.22.1.el8_10.x86_64
    • CPU per node: 16
    • Memory per node: 64GB
    • Disk type (e.g. SSD/NVMe/HDD): SSD
    • Network bandwidth between the nodes (Gbps): 10Gbps
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Cloud
  • Number of Longhorn volumes in the cluster: 49
@pwurbs pwurbs added kind/bug require/backport Require backport. Only used when the specific versions to backport have not been definied. require/qa-review-coverage Require QA to review coverage labels Oct 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug require/backport Require backport. Only used when the specific versions to backport have not been definied. require/qa-review-coverage Require QA to review coverage
Projects
Status: New
Status: New Issues
Development

No branches or pull requests

1 participant