Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DRAFT]: DRA in Kubevirt #331

Draft
wants to merge 8 commits into
base: main
Choose a base branch
from
Draft

Conversation

alaypatel07
Copy link

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Special notes for your reviewer:

Checklist

This checklist is not enforcing, but it's a reminder of items that could be relevant to every PR.
Approvers are expected to review this list.

Release note:


@kubevirt-bot
Copy link

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@kubevirt-bot kubevirt-bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. labels Oct 7, 2024
@kubevirt-bot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign cwilkers for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Comment on lines 92 to 95
// Name of the GPU device as exposed by a device plugin
Name string `json:"name"`
// DeviceName is the name of the device provisioned by device-plugins
DeviceName string `json:"deviceName,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference between the 2?

Copy link
Author

@alaypatel07 alaypatel07 Oct 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alicefr This is not a new API being added, it is how the existing device-plugins work, see example here. The idea is that if DeviceName is populated, then its a device provisioned by device-plugins, it will be passed to pod.spec.container.requests. Alternatively the Claim field below is not nil then its a DRA device, it will be passed to pod.spec.container.claims.

We have an alternate API suggestion that will make this choice to the user much more obvious.

Comment on lines 105 to 107
Name string `json:"name"`
// DeviceName is the name of the device provisioned by device-plugins
DeviceName string `json:"deviceName,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question here

Comment on lines +519 to +520
This design also assumes that the deviceName will be provided in the ClaimParameters, which requires the DRA drivers
to have a ClaimParameters.spec.deviceName in their spec.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From this description you are distinguishing the name and the deviceName which also refer to my previous questions. Maybe it will be clearer to add the difference between the 2 at the beginning

1. For devices generated using DRA, virt-launcher needs to use the vmi.status.deviceStatus to generate the domxml
instead of environment variables as in the case of device-plugins

# Alternate Designs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to prefer the first design since it follow the same schema as the volumes/disks and the network/interfaces where we have the section for the pod resource and how they are mapped to the VM

### Virt launcher changes

1. For devices generated using DRA, virt-launcher needs to use the vmi.status.deviceStatus to generate the domxml
instead of environment variables as in the case of device-plugins
Copy link
Member

@alicefr alicefr Oct 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for avoiding the env variables, I like the status!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my perspective, there is no need to store driver internal data in the public, user facing, VMI status.
This information is only required by the virt-launcher internals, no other kubevirt components benefit from this.
I fail to understand how polluting the VMI status is superior to providing specific data to the virt-launcher's compute container directly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same way as we don't keep the PVC status in VMI we should keep DRA specific data out of the VMI status.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my perspective, there is no need to store driver internal data in the public, user facing, VMI status.
This information is only required by the virt-launcher internals, no other kubevirt components benefit from this.
I fail to understand how polluting the VMI status is superior to providing specific data to the virt-launcher's compute container directly.

This is in part to address an architectural difference between DPs and DRA:

  • DPs pre-define a static resource which is usually identified by a device model: Eg: nvidia.com/GP102GL_Tesla_P40 . A VMI spec would reference this name directly for its device and it gives easily inferrable information to the user that the VMI will run on a nvidia Tesla P40 GPU.
  • In DRA, this information is lost as the resource claim that we provide in our spec masks all details about what kind of device that claim is allocating. A resourceclaim is a dynamic object where you would need to look into its spec to know what piece of hardware you're getting. For the user, this means that they can no longer look into the vmi spec to immediately tell what their VMI is running on.

Given this regression in the user expectation (conscious or unconscious), we should provide atleast the bare-minimum identifiable information about a device like the hardware model etc.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same way as we don't keep the PVC status in VMI we should keep DRA specific data out of the VMI status.

In DRA, once a resource claim is allocated, you'll need to take the following steps to identify what device is allocated:

  • look at the vmi spec to see the referenced claim (template or name) and request.
  • look at virt-launcher pod status to match the claim from vmi spec and infer the resource claim k8s object
  • look at the resource claim k8s object to find out the driver and device names for each request.
  • look at the resource slice object for the VMI's node published by the corresponding driver and know all the relevant attributes needed

A PVC needs similar hoops to find the volume information but is a lot less complex than DRA. To tackle this problem, PVC information is currently posted onto the VMI status so consumers (users & kubevirt) can easily find the info about a volume. (See https://pkg.go.dev/kubevirt.io/api/core/v1#PersistentVolumeClaimInfo)

Given that DRA ResourceClaims are modeled after PVCs and we have precedence for publishing volume information in the VMI status to make visilibity/access easier, I believe we should follow the same approach for DRA devices.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vladikr the particular case that @varunrsekar mentioned in the comment was discussed in the meeting on 10/15 (recording) and documented in detail here: https://docs.google.com/document/d/1bQdLoxwSC1ILvyIb4ljSm5RZmsVpvqfWKIbN0KOBKsw/edit#heading=h.wjmanqk7qzd9

Additionally, we came across another use-case of why the information is needed in vmi-status:

Usecase Title: Debuggability of VMI in failed state with DRA device

  • K8s has a feature of releasing the resource claims once the pod has gone to a terminal state(Failed or Succeeded). This is to allow for expensive resources to be used for other pods
  • in case where VMI has failed and the virt-launcher has gone to a Failed state, what hardware was attempted to be provisioned for this VMI is lost

It is documented in depth here: https://docs.google.com/document/d/1bQdLoxwSC1ILvyIb4ljSm5RZmsVpvqfWKIbN0KOBKsw/edit#heading=h.3z09ita4dzuz

Please let use know if this answers the question about why extending the VMI status with this information is useful to the user

@alicefr
Copy link
Member

alicefr commented Oct 8, 2024

@alaypatel07 the proposal looks great.
One question. You mention in the goal section to allow external DRA drivers. Do we want to have a mechanism to enable certain dra driver or is this unnecessary? For CSI storge classes we don't have, but for the external device plugin we do.

### Virt launcher changes

1. For devices generated using DRA, virt-launcher needs to use the vmi.status.deviceStatus to generate the domxml
instead of environment variables as in the case of device-plugins
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we populate the deviceStatus inormation with an external DRA driver?

Copy link
Author

@alaypatel07 alaypatel07 Oct 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alicefr as far as this proposal concerns, the assumption is that drivers publish information needed to generate the right domxml for the device. For instance in case of a gpu, it could either be pciAddress(pGPU), or mdev uuid (vGPU). The drivers need to publish this information in ResourceSlice as attributes https://pkg.go.dev/k8s.io/[email protected]/resource/v1alpha3#BasicDevice.

Virt-controller will find the device for it and put this attribute in the vmi status using the steps mentioned here: https://github.com/kubevirt/community/blob/30a2005d8ef72cbae0d8fb1d818861b14d8a5d88/design-proposals/dra.md#dra-api-for-reading-device-related-information

Virt-launcher will then read the attribute to generate the correct domxml

So there could be two cases here:
- either the device has standard attributes, like pciAddress and uuid which will be supported in tree or
- the device needs additional attributes, for which a sidecar will be needed to read that attribute from vmi status and define it in the domxml

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned above, please provide more usecases for the sidecar scenario. I've provided an alternative take.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said in one of the above comments, will be updating the proposal to remove sidecar scenario due to lack of usecases, updated my previous comment to reflect it.

@iholder101
Copy link
Contributor

/cc

Please add to the PR description that it's a continuation of #293.

@alaypatel07
Copy link
Author

You mention in the goal section to allow external DRA drivers. Do we want to have a mechanism to enable certain dra driver or is this unnecessary? For CSI storge classes we don't have, but for the external device plugin we do.

Answered it inline, the proposal will implement the conversion logic of standard attributes in vmi status (pciAddress and uuid) for other attributes, a extension in form of side car is needed.

Varun Ramachandra Sekar and others added 2 commits October 8, 2024 10:43
Signed-off-by: Varun Ramachandra Sekar <[email protected]>
Comment on lines 182 to 185
The virt-launcher will have the logic of converting a GPU device into its corresponding domxml. For use-cases that are
not handled in-tree, a sidecar container could be envisioned which will convert the information available in status to
the corresponding domxml.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Could you please provide a specific use case for the proposed use of a sidecar? It would also help to understand why this configuration cannot be handled directly in the virt-launcher itself.
  2. Currently, a sidecar can't access the VMI object without changes to the virt-controller. How do you plan to differentiate between DRA-related sidecars and other sidecars?
  3. I assume that the proposed sidecar would rely on KubeVirt's existing hooking mechanism. It's worth noting that this mechanism is provided on a best-effort basis and is not fully supported. It uses gRPC with a defined API, and rather than adding DRA-specific data to the VMI status, a new API call could be integrated into the existing gRPC used by virt-launcher. This way, the DRA-specific data could be handled only between virt-launcher and the sidecar and avoid replication of the DRA data across multiple components...
  4. Building a solution based on the current hooking mechanism, or modifying the libvirt XML outside of virt-launcher, is not recommended. The libvirt XML is not a supported external API—only the KubeVirt API is. Please carefully consider how to maintain API compatibility between DRA, libvirt, and KubeVirt. If supporting a sidecar is necessary, we should develop a stable interface between compute and the DRA sidecar containers.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After last week’s discussion and another offline conversation with @xpivarc, its looks like currently we do not have enough use-cases for allowing users to configure these devices in sidecar. It seems like we would like encourage building the support for these devices in the core, rather than allowing them to be configured by a sidecar, I will modify this proposal to change the language to reflect this

### Virt launcher changes

1. For devices generated using DRA, virt-launcher needs to use the vmi.status.deviceStatus to generate the domxml
instead of environment variables as in the case of device-plugins
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my perspective, there is no need to store driver internal data in the public, user facing, VMI status.
This information is only required by the virt-launcher internals, no other kubevirt components benefit from this.
I fail to understand how polluting the VMI status is superior to providing specific data to the virt-launcher's compute container directly.

### Virt launcher changes

1. For devices generated using DRA, virt-launcher needs to use the vmi.status.deviceStatus to generate the domxml
instead of environment variables as in the case of device-plugins
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same way as we don't keep the PVC status in VMI we should keep DRA specific data out of the VMI status.

### Virt launcher changes

1. For devices generated using DRA, virt-launcher needs to use the vmi.status.deviceStatus to generate the domxml
instead of environment variables as in the case of device-plugins
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned above, please provide more usecases for the sidecar scenario. I've provided an alternative take.

Comment on lines 232 to 248
deviceStatus:
gpuStatuses:
- deviceResourceClaimStatus:
deviceAttributes:
driverVersion:
version: 1.0.0
index:
int: 0
model:
string: LATEST-GPU-MODEL
uuid:
string: gpu-8e942949-f10b-d871-09b0-ee0657e28f90
pciAddress:
string: 0000:01:00.0
deviceName: gpu-0
resourceClaimName: virt-launcher-vmi-fedora-9bjwb-gpu-resource-claim-m4k28
name: pgpu
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment, I am not convinced that storing this information in the VMI status is needed.
What other Kubevirt components require this information?
Why CDI cannot provide this info directly to the consumer container?

@alaypatel07
Copy link
Author

@alicefr From our discussion in the meeting on 10/15 on lifecycle of resourceclaim and VM/VMI

@varunrsekar and I looked at the nvme implementation and it seems that it was developed on an older version of the DRA apis (v1alpha2). With the newer version (k8s 1.31 and v1alpha3 of the ResourceClaim api), if the pod goes to failed state:

  1. if the resourseclaim is being created by template, the resource claim will be deleted
  2. if the resourceclaim is created by the user, the resource claim goes from allocated to pending state where the allocations are cleared

Here is the running demo from 1.31 cluster
1.

$ k version
Client Version: v1.31.1
Kustomize Version: v5.4.2
Server Version: v1.31.0
  1. create the vmi with manual resource claim
$ k apply -f demo-with-manual-claim.yaml
namespace/gpu-test1 unchanged
resourceclaim.resource.k8s.io/virt-launcher-vmi-fedora created
Warning: metadata.finalizers: "foregroundDeleteVirtualMachine": prefer a domain-qualified finalizer name to avoid         accidental conflicts with other finalizer writers
virtualmachineinstance.kubevirt.io/vmi-fedora created
  1. get the status of virt-launcher pod to check the claim it used:
$ k get pods -n gpu-test1 -oyaml| grep resourceClaimName
      resourceClaimName: virt-launcher-vmi-fedora
  1. check the pod is in running state
$ k get pods -n gpu-test1 -oyaml | grep phase
    phase: Running
  1. check the resource claim status:
$ k get resourceclaim -n gpu-test1 -oyaml | grep -A 5 status
  status:
    allocation:
      devices:
        config:
        - opaque:
            driver: gpu.nvidia.com
  1. wait for pod to go to failed state
$ k get pods -n gpu-test1 -oyaml -w | grep phase
  phase: Running
  phase: Running
  phase: Failed
  1. check the resourceclaim status:
$ k get resourceclaim -n gpu-test1 -oyaml | grep -A 5 status
  status: {}

This is further seen in the KEP: https://github.com/kubernetes/enhancements/pull/4709/files#diff-8fa237d276346416c2aafa209f721e43c9762f59cabec234eafafe01694de3eeR1221

I wonder if the use-cases that require the lifecycle of the ResourceClaim not attached to the lifecycle of the pod, be deferred to the future versions of DRA where it is possible for the allocation to persist after pod is deleted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dco-signoff: yes Indicates the PR's author has DCO signed all their commits. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XL
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants