Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect NUMA Node and CPU Pinning During VM Migration #6772

Open
1 of 3 tasks
feldsam opened this issue Nov 4, 2024 · 1 comment
Open
1 of 3 tasks

Incorrect NUMA Node and CPU Pinning During VM Migration #6772

feldsam opened this issue Nov 4, 2024 · 1 comment

Comments

@feldsam
Copy link
Contributor

feldsam commented Nov 4, 2024

/!\ To report a security issue please follow this procedure:
[https://github.com/OpenNebula/one/wiki/Vulnerability-Management-Process]

Description
The current implementation for Huge Pages support, as per the enhancement "Support use of huge pages without CPU pinning #6185," selects a NUMA node based on free resources. The scheduling mechanism effectively balances load across NUMA nodes. However, issues arise during VM migration, leading to inconsistencies.

To Reproduce

  1. Configure a VM to use Huge Pages and deploy it on a host.
  2. Initiate a migration using the standard SAVE/Restore or Live migration method.
  3. Observe that the VM continues to use the old NUMA node on the target host, even if the scheduler selects a different NUMA node based on the target host’s free resources.
  4. If there is insufficient memory in the old NUMA node on the target, the migration may fail.
  5. Deploy new VMs and note inconsistencies caused by incorrectly pinned VMs.

Expected behavior

  • When a VM is migrated using SAVE/Restore or Live migration methods, the NUMA node assignments should be updated based on the scheduler's decision.
  • The migration should update the VM’s configuration with the correct NUMA assignments, avoiding failures and maintaining scheduling consistency.

Details

  • Affected Component: Scheduler, Virtual Machine Manager (VMM)
  • Hypervisor: KVM
  • Version: All

Additional context

  • During SAVE and Live migration operations, we can use the --xml option to provide a new XML configuration file with the updated NUMA topology and CPU pinning information. This ensures that the VM's NUMA node and CPU assignments are correctly updated on the target host.

Progress Status

feldsam added a commit to FELDSAM-INC/one that referenced this issue Nov 4, 2024
…VM Save and Live Migration

Signed-off-by: Kristian Feldsam <[email protected]>
feldsam added a commit to FELDSAM-INC/one that referenced this issue Nov 4, 2024
feldsam added a commit to FELDSAM-INC/one that referenced this issue Nov 4, 2024
feldsam added a commit to FELDSAM-INC/one that referenced this issue Dec 20, 2024
feldsam added a commit to FELDSAM-INC/one that referenced this issue Dec 20, 2024
feldsam added a commit to FELDSAM-INC/one that referenced this issue Dec 20, 2024
feldsam added a commit to FELDSAM-INC/one that referenced this issue Dec 20, 2024
@feldsam
Copy link
Contributor Author

feldsam commented Dec 20, 2024

Hello @rsmontero @paczerny,

I just fixed my code about deleting capacity from previous host and tested it in lab environment. I tested all migration models:

  • live
  • save - suspend/restore
  • poweroff
  • poweroff hard

This is related to #6596, where only save - suspend/restore was implemented.

I also tested bigger VMs, which spans across more numa nodes and I see another bug - only first numa node CPUs are cleared. Should I report this as new issue?

Thanks!

feldsam added a commit to FELDSAM-INC/one that referenced this issue Dec 20, 2024
…VM Save and Live Migration

Signed-off-by: Kristian Feldsam <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant