Incorrect NUMA Node and CPU Pinning During VM Migration #6772

feldsam · 2024-11-04T20:05:19Z

/!\ To report a security issue please follow this procedure:
[https://github.com/OpenNebula/one/wiki/Vulnerability-Management-Process]

Description
The current implementation for Huge Pages support, as per the enhancement "Support use of huge pages without CPU pinning #6185," selects a NUMA node based on free resources. The scheduling mechanism effectively balances load across NUMA nodes. However, issues arise during VM migration, leading to inconsistencies.

To Reproduce

Configure a VM to use Huge Pages and deploy it on a host.
Initiate a migration using the standard SAVE/Restore or Live migration method.
Observe that the VM continues to use the old NUMA node on the target host, even if the scheduler selects a different NUMA node based on the target host’s free resources.
If there is insufficient memory in the old NUMA node on the target, the migration may fail.
Deploy new VMs and note inconsistencies caused by incorrectly pinned VMs.

Expected behavior

When a VM is migrated using SAVE/Restore or Live migration methods, the NUMA node assignments should be updated based on the scheduler's decision.
The migration should update the VM’s configuration with the correct NUMA assignments, avoiding failures and maintaining scheduling consistency.

Details

Affected Component: Scheduler, Virtual Machine Manager (VMM)
Hypervisor: KVM
Version: All

Additional context

During SAVE and Live migration operations, we can use the --xml option to provide a new XML configuration file with the updated NUMA topology and CPU pinning information. This ensures that the VM's NUMA node and CPU assignments are correctly updated on the target host.

Progress Status

Code committed - PR B #6772: Fix for NUMA and CPU Pinning Discrepancies During VM Save and Live Migration #6773
Testing - QA
Documentation (Release notes - resolved issues, compatibility, known issues)

…VM Save and Live Migration Signed-off-by: Kristian Feldsam <[email protected]>

Signed-off-by: Kristian Feldsam <[email protected]>

feldsam · 2024-12-20T15:36:35Z

Hello @rsmontero @paczerny,

I just fixed my code about deleting capacity from previous host and tested it in lab environment. I tested all migration models:

live
save - suspend/restore
poweroff
poweroff hard

This is related to #6596, where only save - suspend/restore was implemented.

I also tested bigger VMs, which spans across more numa nodes and I see another bug - only first numa node CPUs are cleared. Should I report this as new issue?

Thanks!

…VM Save and Live Migration Signed-off-by: Kristian Feldsam <[email protected]>

feldsam added the Type: Bug label Nov 4, 2024

feldsam added a commit to FELDSAM-INC/one that referenced this issue Nov 4, 2024

B OpenNebula#6772: Fix for NUMA and CPU Pinning Discrepancies During …

305b33e

…VM Save and Live Migration Signed-off-by: Kristian Feldsam <[email protected]>

feldsam added a commit to FELDSAM-INC/one that referenced this issue Nov 4, 2024

B OpenNebula#6772: Additional fix Host NUMA nodes after VM migration

aabbba2

Signed-off-by: Kristian Feldsam <[email protected]>

feldsam added a commit to FELDSAM-INC/one that referenced this issue Nov 4, 2024

B OpenNebula#6772: Additional fix Host NUMA nodes after VM migration

4764133

Signed-off-by: Kristian Feldsam <[email protected]>

feldsam added a commit to FELDSAM-INC/one that referenced this issue Dec 20, 2024

B OpenNebula#6772: Additional fix Host NUMA nodes after VM migration

2c61e81

Signed-off-by: Kristian Feldsam <[email protected]>

feldsam added a commit to FELDSAM-INC/one that referenced this issue Dec 20, 2024

B OpenNebula#6772: Additional fix Host NUMA nodes after VM migration

c5c2c7a

Signed-off-by: Kristian Feldsam <[email protected]>

feldsam added a commit to FELDSAM-INC/one that referenced this issue Dec 20, 2024

B OpenNebula#6772: Additional fix Host NUMA nodes after VM migration

b480272

Signed-off-by: Kristian Feldsam <[email protected]>

feldsam added a commit to FELDSAM-INC/one that referenced this issue Dec 20, 2024

B OpenNebula#6772: Additional fix Host NUMA nodes after VM migration

5988ca9

Signed-off-by: Kristian Feldsam <[email protected]>

feldsam added a commit to FELDSAM-INC/one that referenced this issue Dec 20, 2024

B OpenNebula#6772: Fix for NUMA and CPU Pinning Discrepancies During …

24b109d

…VM Save and Live Migration Signed-off-by: Kristian Feldsam <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect NUMA Node and CPU Pinning During VM Migration #6772

Incorrect NUMA Node and CPU Pinning During VM Migration #6772

feldsam commented Nov 4, 2024 •

edited

Loading

feldsam commented Dec 20, 2024 •

edited

Loading

Incorrect NUMA Node and CPU Pinning During VM Migration #6772

Incorrect NUMA Node and CPU Pinning During VM Migration #6772

Comments

feldsam commented Nov 4, 2024 • edited Loading

Progress Status

feldsam commented Dec 20, 2024 • edited Loading

feldsam commented Nov 4, 2024 •

edited

Loading

feldsam commented Dec 20, 2024 •

edited

Loading