Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[stable-only][cve] Check VMDK create-type against an allowed list #21

Open
wants to merge 9 commits into
base: stackhpc/ussuri
Choose a base branch
from

Conversation

JohnGarbutt
Copy link
Member

NOTE(sbauza): Stable policy allows us to proactively merge a backport without waiting for the parent patch to be merged (exception to rule #4 in [1]. Marking [stable-only] in order to silence nova-tox-validate-backport

[1] https://docs.openstack.org/project-team-guide/stable-branches.html#appropriate-fixes

Conflicts vs victoria in:
nova/conf/compute.py

Related-Bug: #1996188
Change-Id: I5a399f1d3d702bfb76c067893e9c924904c8c360 (cherry picked from commit bf8b6f698c23c631c0a71d77f2716ca292c45e74)

markgoddard and others added 9 commits August 4, 2022 15:21
Bug 1853009 describes a race condition involving multiple nova-compute
services with ironic. As the compute services start up, the hash ring
rebalances, and the compute services have an inconsistent view of which
is responsible for a compute node.

The sequence of actions here is adapted from a real world log [1], where
multiple nova-compute services were started simultaneously. In some
cases mocks are used to simulate race conditions.

There are three main issues with the behaviour:

* host2 deletes the orphan node compute node after host1 has taken
  ownership of it.

* host1 assumes that another compute service will not delete its nodes.
  Once a node is in rt.compute_nodes, it is not removed again unless the
  node is orphaned. This prevents host1 from recreating the compute
  node.

* host1 assumes that another compute service will not delete its
  resource providers. Once an RP is in the provider tree, it is not
  removed.

This functional test documents the current behaviour, with the idea that
it can be updated as this behaviour is fixed.

[1] http://paste.openstack.org/show/786272/

Co-Authored-By: Matt Riedemann <[email protected]>

Change-Id: Ice4071722de54e8d20bb8c3795be22f1995940cd
Related-Bug: #1853009
Related-Bug: #1853159
(cherry picked from commit 59d9871)
(cherry picked from commit c260e75)
(cherry picked from commit 34d5bca6bc612f41ac3fc33447cd2ba83bf446ba)
(cherry picked from commit f57c4965eef8e25656635897f46dccfe23c3fa5c)
There is a race condition in nova-compute with the ironic virt driver as
nodes get rebalanced. It can lead to compute nodes being removed in the
DB and not repopulated. Ultimately this prevents these nodes from being
scheduled to.

The issue being addressed here is that if a compute node is deleted by a host
which thinks it is an orphan, then the compute host that actually owns the node
might not recreate it if the node is already in its resource tracker cache.

This change fixes the issue by clearing nodes from the resource tracker cache
for which a compute node entry does not exist. Then, when the available
resource for the node is updated, the compute node object is not found in the
cache and gets recreated.

Conflicts:
    nova/tests/unit/compute/test_resource_tracker.py

NOTE(melwitt): The conflict is because change
I142a1f24ff2219cf308578f0236259d183785cff (Provider Config File:
Functions to merge provider configs to provider tree) is not in Ussuri.

Change-Id: I39241223b447fcc671161c370dbf16e1773b684a
Partial-Bug: #1853009
(cherry picked from commit 32676a9)
(cherry picked from commit f950ced)
(cherry picked from commit 32887f82c9bd430f285ff4418811dffdd88e398f)
(cherry picked from commit 4b5d0804749685c8afe182b716b7536045432a7e)
There is a race condition in nova-compute with the ironic virt driver
as nodes get rebalanced. It can lead to compute nodes being removed in
the DB and not repopulated. Ultimately this prevents these nodes from
being scheduled to.

The issue being addressed here is that if a compute node is deleted by a
host which thinks it is an orphan, then the resource provider for that
node might also be deleted. The compute host that owns the node might
not recreate the resource provider if it exists in the provider tree
cache.

This change fixes the issue by clearing resource providers from the
provider tree cache for which a compute node entry does not exist. Then,
when the available resource for the node is updated, the resource
providers are not found in the cache and get recreated in placement.

Change-Id: Ia53ff43e6964963cdf295604ba0fb7171389606e
Related-Bug: #1853009
Related-Bug: #1841481
(cherry picked from commit 2bb4527)
(cherry picked from commit 0fc104e)
(cherry picked from commit 456bcc784340a5674d25fbe6e707c36c583b504f)
(cherry picked from commit 8fa5b881c9b8fb78ceca7ae02d808f1fe64d72e4)
There is a race condition in nova-compute with the ironic virt driver as
nodes get rebalanced. It can lead to compute nodes being removed in the
DB and not repopulated. Ultimately this prevents these nodes from being
scheduled to.

The main race condition involved is in update_available_resources in
the compute manager. When the list of compute nodes is queried, there is
a compute node belonging to the host that it does not expect to be
managing, i.e. it is an orphan. Between that time and deleting the
orphan, the real owner of the compute node takes ownership of it ( in
the resource tracker). However, the node is still deleted as the first
host is unaware of the ownership change.

This change prevents this from occurring by filtering on the host when
deleting a compute node. If another compute host has taken ownership of
a node, it will have updated the host field and this will prevent
deletion from occurring. The first host sees this has happened via the
ComputeHostNotFound exception, and avoids deleting its resource
provider.

Co-Authored-By: melanie witt <[email protected]>

Conflicts:
    nova/compute/manager.py

NOTE(melwitt): The conflict is because change
I23bb9e539d08f5c6202909054c2dd49b6c7a7a0e (Remove six.text_type (1/2))
is not in Victoria.

Closes-Bug: #1853009
Related-Bug: #1841481

Change-Id: I260c1fded79a85d4899e94df4d9036a1ee437f02
(cherry picked from commit a8492e8)
(cherry picked from commit cbbca58)
(cherry picked from commit 737a4ccd1398a5cf28a67547f226b668f246c784)
(cherry picked from commit 9aef6370d142a933bb8120b7a4fc71e65e3023bf)
In the fix for bug 1839560 [1][2], soft-deleted compute nodes may be
restored, to ensure we can reuse ironic node UUIDs as compute node
UUIDs. While this seems to largely work, it results in some nasty errors
being generated [3]:

    InvalidRequestError This session is in 'inactive' state, due to the
    SQL transaction being rolled back; no further SQL can be emitted
    within this transaction.

This happens because compute_node_create is decorated with
pick_context_manager_writer, which begins a transaction. While
_compute_node_get_and_update_deleted claims that calling a second
pick_context_manager_writer decorated function will begin a new
subtransaction, this does not appear to be the case.

This change removes pick_context_manager_writer from the
compute_node_create function, and adds a new _compute_node_create
function which ensures the transaction is finished if
_compute_node_get_and_update_deleted is called.

The new unit test added here fails without this change.

This change marks the removal of the final FIXME from the functional
test added in [4].

[1] https://bugs.launchpad.net/nova/+bug/1839560
[2] https://git.openstack.org/cgit/openstack/nova/commit/?id=89dd74ac7f1028daadf86cb18948e27fe9d1d411
[3] http://paste.openstack.org/show/786350/
[4] https://review.opendev.org/#/c/695012/

Change-Id: Iae119ea8776bc7f2e5dbe2e502a743217beded73
Closes-Bug: #1853159
Related-Bug: #1853009
(cherry picked from commit 2383cbb)
(cherry picked from commit 665c053)
(cherry picked from commit b1331d0bed7a414591a389f5eca8700fe2fec136)
(cherry picked from commit 43e333f39cb3fb60db136205fddba31ddb97ecf2)
When the nova-compute service starts, by default it attempts to
startup instance configuration states for aspects such as networking.
This is fine in most cases, and makes a lot of sense if the
nova-compute service is just managing virtual machines on a hypervisor.

This is done, one instance at a time.

However, when the compute driver is ironic, the networking is managed
as part of the physical machine lifecycle potentially all the way into
committed switch configurations. As such, there is no need to attempt
to call ``plug_vifs`` on every single instance managed by the
nova-compute process which is backed by Ironic.

Additionally, using ironic tends to manage far more physical machines
per nova-compute service instance then when when operating co-installed
with a hypervisor. Often this means a cluster of a thousand machines,
with three controllers, will see thousands of un-needed API calls upon
service start, which elongates the entire process and negatively
impacts operations.

In essence, nova.virt.ironic's plug_vifs call now does nothing,
and merely issues a debug LOG entry when called.

Closes-Bug: #1777608
Change-Id: Iba87cef50238c5b02ab313f2311b826081d5b4ab
(cherry picked from commit 7f81cf2)
(cherry picked from commit eb6d70f)
(cherry picked from commit b8d8e912e0123b0f6af3a21183149c0f492330e6)
(cherry picked from commit 54a0f08e4db1a55f3be233ca63e6100769a722ad)
(cherry picked from commit b6cc159eb7a199a83a2489a06f92837e77e5e3b1)
Nova's use of libvirt's compareCPU() API served its purpose
over the years, but its design limitations break live migration in
subtle ways.  For example, the compareCPU() API compares against the
host physical CPUID.  Some of the features from this CPUID aren not
exposed by KVM, and then there are some features that KVM emulates that
are not in the host CPUID.  The latter can cause bogus live migration
failures.

With QEMU >=2.9 and libvirt >= 4.4.0, libvirt will do the right thing in
terms of CPU compatibility checks on the destination host during live
migration.  Nova satisfies these minimum version requirements by a good
margin.  So, provide a workaround to skip the CPU comparison check on
the destination host before migrating a guest, and let libvirt handle it
correctly.  This workaround will be removed once Nova replaces the older
libvirt APIs with their newer and improved counterparts[1][2].

                - - -

Note that Nova's libvirt driver calls compareCPU() in another method,
_check_cpu_compatibility(); I did not remove its usage yet.  As it needs
more careful combing of the code, and then:

  - where possible, remove the usage of compareCPU() altogether, and
    rely on libvirt doing the right thing under the hood; or

  - where Nova _must_ do the CPU comparison checks, switch to the better
    libvirt CPU APIs -- baselineHypervisorCPU() and
    compareHypervisorCPU() -- that are described here[1].  This is work
    in progress[2].

[1] https://opendev.org/openstack/nova-specs/commit/70811da221035044e27
[2] https://review.opendev.org/q/topic:bp%252Fcpu-selection-with-hypervisor-consideration

Change-Id: I444991584118a969e9ea04d352821b07ec0ba88d
Closes-Bug: #1913716
Signed-off-by: Kashyap Chamarthy <[email protected]>
Signed-off-by: Balazs Gibizer <[email protected]>
(cherry picked from commit 267a406)
NOTE(sbauza): Stable policy allows us to proactively merge a backport without waiting for the parent patch to be merged (exception to rule #4 in [1]. Marking [stable-only] in order to silence nova-tox-validate-backport

[1] https://docs.openstack.org/project-team-guide/stable-branches.html#appropriate-fixes

Conflicts vs victoria in:
	nova/conf/compute.py

Related-Bug: #1996188
Change-Id: I5a399f1d3d702bfb76c067893e9c924904c8c360
(cherry picked from commit bf8b6f698c23c631c0a71d77f2716ca292c45e74)
Change-Id: I6b5d0537d4279e6a1c020595056707edbc81fa9c
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants