[BUG] - Longhorn instance-manager only uses max 2 CPU #9722
Labels
kind/bug
require/backport
Require backport. Only used when the specific versions to backport have not been definied.
require/qa-review-coverage
Require QA to review coverage
Describe the bug
There is an app with rather high IO (500-1000 IOPS). If there is a Longhorn background task (like replica rebuild or snapshot deletion), then the instance-manager pod running at the relevant worker node is under heavy load and consumes 1-2 CPU.
There are 2 issues:
The instance-manager pod seems to be the bottleneck. The actual node disk doesn't seem to be the bottleneck (about 20% utilization).
The worker node provides 16 CPU overall. Max 2 CPUs are used by the instance-manager, max 1 CPU by other pods, 13 CPU are mostly free.
Now the bug:
If the instance-manager seems to be so busy (leading to over utilization of the logical sdm device and quite slow background tasks), why doesn't the instance-manager pod use more than 2 CPU? The CPU request value (danger zone) is kept as default (1.92 CPU), no K8S limit configured.
Is there any limitation within the instance-manager software not being able to scale to use more CPU?
Screenshot shows the CPU graph of the instance-manager pod.
Increased CPU load starts at about 05:00 when the snapshot merge started. No more than 2 CPU is used
This screenshot shows the disk utilization over time:
logical "sdm" device nearly 100%
"physical" node device "vdc" about 20%
To Reproduce
Expected behavior
instance-manager should use more than 2 CPU when required and if available on the worker node
Support bundle for troubleshooting
Environment
The text was updated successfully, but these errors were encountered: