[BUG] - Longhorn instance-manager only uses max 2 CPU #9722

pwurbs · 2024-10-28T15:23:46Z

Describe the bug

There is an app with rather high IO (500-1000 IOPS). If there is a Longhorn background task (like replica rebuild or snapshot deletion), then the instance-manager pod running at the relevant worker node is under heavy load and consumes 1-2 CPU.
There are 2 issues:

The background task is quite slow. E.g. snapshot deletion (merge 250GB into 1.8TB) takes 15hrs. For the same snapshot deletion on another node without IO intense applications only 1hr is needed
IO intense application suffers due to overloaded instance-manager, IO utilization for the logical Longhorn /dev/sdm device shows nearly 100%.

The instance-manager pod seems to be the bottleneck. The actual node disk doesn't seem to be the bottleneck (about 20% utilization).
The worker node provides 16 CPU overall. Max 2 CPUs are used by the instance-manager, max 1 CPU by other pods, 13 CPU are mostly free.
Now the bug:
If the instance-manager seems to be so busy (leading to over utilization of the logical sdm device and quite slow background tasks), why doesn't the instance-manager pod use more than 2 CPU? The CPU request value (danger zone) is kept as default (1.92 CPU), no K8S limit configured.
Is there any limitation within the instance-manager software not being able to scale to use more CPU?

Screenshot shows the CPU graph of the instance-manager pod.
Increased CPU load starts at about 05:00 when the snapshot merge started. No more than 2 CPU is used

This screenshot shows the disk utilization over time:
logical "sdm" device nearly 100%
"physical" node device "vdc" about 20%

To Reproduce

Deploy an IO-intense application
Start a Longhorn background task
Watch the instance-manager pod metrics
Watch the slow progress of background tasks

Expected behavior

instance-manager should use more than 2 CPU when required and if available on the worker node

Support bundle for troubleshooting

Environment

Longhorn version: 1.5.5
Impacted volume (PV):
Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Helm
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: RKE
- Number of control plane nodes in the cluster: 3
- Number of worker nodes in the cluster: 6
Node config
- OS type and version: AlmaLinux 8.10 (Cerulean Leopard)
- Kernel version: 4.18.0-553.22.1.el8_10.x86_64
- CPU per node: 16
- Memory per node: 64GB
- Disk type (e.g. SSD/NVMe/HDD): SSD
- Network bandwidth between the nodes (Gbps): 10Gbps
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Cloud
Number of Longhorn volumes in the cluster: 49

pwurbs added kind/bug require/backport Require backport. Only used when the specific versions to backport have not been definied. require/qa-review-coverage Require QA to review coverage labels Oct 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] - Longhorn instance-manager only uses max 2 CPU #9722

[BUG] - Longhorn instance-manager only uses max 2 CPU #9722

pwurbs commented Oct 28, 2024

[BUG] - Longhorn instance-manager only uses max 2 CPU #9722

[BUG] - Longhorn instance-manager only uses max 2 CPU #9722

Comments

pwurbs commented Oct 28, 2024

Describe the bug

To Reproduce

Expected behavior

Support bundle for troubleshooting

Environment