[stable-only][cve] Check VMDK create-type against an allowed list #21

JohnGarbutt · 2023-01-25T11:39:08Z

NOTE(sbauza): Stable policy allows us to proactively merge a backport without waiting for the parent patch to be merged (exception to rule #4 in [1]. Marking [stable-only] in order to silence nova-tox-validate-backport

[1] https://docs.openstack.org/project-team-guide/stable-branches.html#appropriate-fixes

Conflicts vs victoria in:
nova/conf/compute.py

Related-Bug: #1996188
Change-Id: I5a399f1d3d702bfb76c067893e9c924904c8c360 (cherry picked from commit bf8b6f698c23c631c0a71d77f2716ca292c45e74)

Bug 1853009 describes a race condition involving multiple nova-compute services with ironic. As the compute services start up, the hash ring rebalances, and the compute services have an inconsistent view of which is responsible for a compute node. The sequence of actions here is adapted from a real world log [1], where multiple nova-compute services were started simultaneously. In some cases mocks are used to simulate race conditions. There are three main issues with the behaviour: * host2 deletes the orphan node compute node after host1 has taken ownership of it. * host1 assumes that another compute service will not delete its nodes. Once a node is in rt.compute_nodes, it is not removed again unless the node is orphaned. This prevents host1 from recreating the compute node. * host1 assumes that another compute service will not delete its resource providers. Once an RP is in the provider tree, it is not removed. This functional test documents the current behaviour, with the idea that it can be updated as this behaviour is fixed. [1] http://paste.openstack.org/show/786272/ Co-Authored-By: Matt Riedemann <[email protected]> Change-Id: Ice4071722de54e8d20bb8c3795be22f1995940cd Related-Bug: #1853009 Related-Bug: #1853159 (cherry picked from commit 59d9871) (cherry picked from commit c260e75) (cherry picked from commit 34d5bca6bc612f41ac3fc33447cd2ba83bf446ba) (cherry picked from commit f57c4965eef8e25656635897f46dccfe23c3fa5c)

There is a race condition in nova-compute with the ironic virt driver as nodes get rebalanced. It can lead to compute nodes being removed in the DB and not repopulated. Ultimately this prevents these nodes from being scheduled to. The issue being addressed here is that if a compute node is deleted by a host which thinks it is an orphan, then the compute host that actually owns the node might not recreate it if the node is already in its resource tracker cache. This change fixes the issue by clearing nodes from the resource tracker cache for which a compute node entry does not exist. Then, when the available resource for the node is updated, the compute node object is not found in the cache and gets recreated. Conflicts: nova/tests/unit/compute/test_resource_tracker.py NOTE(melwitt): The conflict is because change I142a1f24ff2219cf308578f0236259d183785cff (Provider Config File: Functions to merge provider configs to provider tree) is not in Ussuri. Change-Id: I39241223b447fcc671161c370dbf16e1773b684a Partial-Bug: #1853009 (cherry picked from commit 32676a9) (cherry picked from commit f950ced) (cherry picked from commit 32887f82c9bd430f285ff4418811dffdd88e398f) (cherry picked from commit 4b5d0804749685c8afe182b716b7536045432a7e)

There is a race condition in nova-compute with the ironic virt driver as nodes get rebalanced. It can lead to compute nodes being removed in the DB and not repopulated. Ultimately this prevents these nodes from being scheduled to. The issue being addressed here is that if a compute node is deleted by a host which thinks it is an orphan, then the resource provider for that node might also be deleted. The compute host that owns the node might not recreate the resource provider if it exists in the provider tree cache. This change fixes the issue by clearing resource providers from the provider tree cache for which a compute node entry does not exist. Then, when the available resource for the node is updated, the resource providers are not found in the cache and get recreated in placement. Change-Id: Ia53ff43e6964963cdf295604ba0fb7171389606e Related-Bug: #1853009 Related-Bug: #1841481 (cherry picked from commit 2bb4527) (cherry picked from commit 0fc104e) (cherry picked from commit 456bcc784340a5674d25fbe6e707c36c583b504f) (cherry picked from commit 8fa5b881c9b8fb78ceca7ae02d808f1fe64d72e4)

There is a race condition in nova-compute with the ironic virt driver as nodes get rebalanced. It can lead to compute nodes being removed in the DB and not repopulated. Ultimately this prevents these nodes from being scheduled to. The main race condition involved is in update_available_resources in the compute manager. When the list of compute nodes is queried, there is a compute node belonging to the host that it does not expect to be managing, i.e. it is an orphan. Between that time and deleting the orphan, the real owner of the compute node takes ownership of it ( in the resource tracker). However, the node is still deleted as the first host is unaware of the ownership change. This change prevents this from occurring by filtering on the host when deleting a compute node. If another compute host has taken ownership of a node, it will have updated the host field and this will prevent deletion from occurring. The first host sees this has happened via the ComputeHostNotFound exception, and avoids deleting its resource provider. Co-Authored-By: melanie witt <[email protected]> Conflicts: nova/compute/manager.py NOTE(melwitt): The conflict is because change I23bb9e539d08f5c6202909054c2dd49b6c7a7a0e (Remove six.text_type (1/2)) is not in Victoria. Closes-Bug: #1853009 Related-Bug: #1841481 Change-Id: I260c1fded79a85d4899e94df4d9036a1ee437f02 (cherry picked from commit a8492e8) (cherry picked from commit cbbca58) (cherry picked from commit 737a4ccd1398a5cf28a67547f226b668f246c784) (cherry picked from commit 9aef6370d142a933bb8120b7a4fc71e65e3023bf)

In the fix for bug 1839560 [1][2], soft-deleted compute nodes may be restored, to ensure we can reuse ironic node UUIDs as compute node UUIDs. While this seems to largely work, it results in some nasty errors being generated [3]: InvalidRequestError This session is in 'inactive' state, due to the SQL transaction being rolled back; no further SQL can be emitted within this transaction. This happens because compute_node_create is decorated with pick_context_manager_writer, which begins a transaction. While _compute_node_get_and_update_deleted claims that calling a second pick_context_manager_writer decorated function will begin a new subtransaction, this does not appear to be the case. This change removes pick_context_manager_writer from the compute_node_create function, and adds a new _compute_node_create function which ensures the transaction is finished if _compute_node_get_and_update_deleted is called. The new unit test added here fails without this change. This change marks the removal of the final FIXME from the functional test added in [4]. [1] https://bugs.launchpad.net/nova/+bug/1839560 [2] https://git.openstack.org/cgit/openstack/nova/commit/?id=89dd74ac7f1028daadf86cb18948e27fe9d1d411 [3] http://paste.openstack.org/show/786350/ [4] https://review.opendev.org/#/c/695012/ Change-Id: Iae119ea8776bc7f2e5dbe2e502a743217beded73 Closes-Bug: #1853159 Related-Bug: #1853009 (cherry picked from commit 2383cbb) (cherry picked from commit 665c053) (cherry picked from commit b1331d0bed7a414591a389f5eca8700fe2fec136) (cherry picked from commit 43e333f39cb3fb60db136205fddba31ddb97ecf2)

When the nova-compute service starts, by default it attempts to startup instance configuration states for aspects such as networking. This is fine in most cases, and makes a lot of sense if the nova-compute service is just managing virtual machines on a hypervisor. This is done, one instance at a time. However, when the compute driver is ironic, the networking is managed as part of the physical machine lifecycle potentially all the way into committed switch configurations. As such, there is no need to attempt to call ``plug_vifs`` on every single instance managed by the nova-compute process which is backed by Ironic. Additionally, using ironic tends to manage far more physical machines per nova-compute service instance then when when operating co-installed with a hypervisor. Often this means a cluster of a thousand machines, with three controllers, will see thousands of un-needed API calls upon service start, which elongates the entire process and negatively impacts operations. In essence, nova.virt.ironic's plug_vifs call now does nothing, and merely issues a debug LOG entry when called. Closes-Bug: #1777608 Change-Id: Iba87cef50238c5b02ab313f2311b826081d5b4ab (cherry picked from commit 7f81cf2) (cherry picked from commit eb6d70f) (cherry picked from commit b8d8e912e0123b0f6af3a21183149c0f492330e6) (cherry picked from commit 54a0f08e4db1a55f3be233ca63e6100769a722ad) (cherry picked from commit b6cc159eb7a199a83a2489a06f92837e77e5e3b1)

Nova's use of libvirt's compareCPU() API served its purpose over the years, but its design limitations break live migration in subtle ways. For example, the compareCPU() API compares against the host physical CPUID. Some of the features from this CPUID aren not exposed by KVM, and then there are some features that KVM emulates that are not in the host CPUID. The latter can cause bogus live migration failures. With QEMU >=2.9 and libvirt >= 4.4.0, libvirt will do the right thing in terms of CPU compatibility checks on the destination host during live migration. Nova satisfies these minimum version requirements by a good margin. So, provide a workaround to skip the CPU comparison check on the destination host before migrating a guest, and let libvirt handle it correctly. This workaround will be removed once Nova replaces the older libvirt APIs with their newer and improved counterparts[1][2]. - - - Note that Nova's libvirt driver calls compareCPU() in another method, _check_cpu_compatibility(); I did not remove its usage yet. As it needs more careful combing of the code, and then: - where possible, remove the usage of compareCPU() altogether, and rely on libvirt doing the right thing under the hood; or - where Nova _must_ do the CPU comparison checks, switch to the better libvirt CPU APIs -- baselineHypervisorCPU() and compareHypervisorCPU() -- that are described here[1]. This is work in progress[2]. [1] https://opendev.org/openstack/nova-specs/commit/70811da221035044e27 [2] https://review.opendev.org/q/topic:bp%252Fcpu-selection-with-hypervisor-consideration Change-Id: I444991584118a969e9ea04d352821b07ec0ba88d Closes-Bug: #1913716 Signed-off-by: Kashyap Chamarthy <[email protected]> Signed-off-by: Balazs Gibizer <[email protected]> (cherry picked from commit 267a406)

NOTE(sbauza): Stable policy allows us to proactively merge a backport without waiting for the parent patch to be merged (exception to rule #4 in [1]. Marking [stable-only] in order to silence nova-tox-validate-backport [1] https://docs.openstack.org/project-team-guide/stable-branches.html#appropriate-fixes Conflicts vs victoria in: nova/conf/compute.py Related-Bug: #1996188 Change-Id: I5a399f1d3d702bfb76c067893e9c924904c8c360 (cherry picked from commit bf8b6f698c23c631c0a71d77f2716ca292c45e74)

Change-Id: I6b5d0537d4279e6a1c020595056707edbc81fa9c

markgoddard and others added 9 commits August 4, 2022 15:21

Merge branch 'scientific-openstack/ussuri' into ussuri-OSSA-2023-002

18c9c16

Change-Id: I6b5d0537d4279e6a1c020595056707edbc81fa9c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[stable-only][cve] Check VMDK create-type against an allowed list #21

[stable-only][cve] Check VMDK create-type against an allowed list #21

JohnGarbutt commented Jan 25, 2023

[stable-only][cve] Check VMDK create-type against an allowed list #21

Are you sure you want to change the base?

[stable-only][cve] Check VMDK create-type against an allowed list #21

Conversation

JohnGarbutt commented Jan 25, 2023