Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Challenges/Observations of loading/unloading out of tree RHEL 9.2 based GPU drivers on RHEL 9.2 based OCP 4.13 #69

Open
hershpa opened this issue Oct 31, 2023 · 1 comment

Comments

@hershpa
Copy link
Contributor

hershpa commented Oct 31, 2023

Summary

There are new RHEL 9.2 based GPU drivers to provision Intel GPU Flex and Max Series. Good news: the new drivers now do not have an incompatibility with ast driver. On RHEL 8.6 based OCP 4.12, ast driver needed to be unloaded or blacklisted (via machine config which triggers reboot) prior to loading out of tree GPU drivers.

Challenges:

In-tree i915 and intel_vsec drivers have to unloaded prior to loading of out of tree drivers. KMM can only unload one in-tree driver as of now. Now, it is found that we have a use case for unloading more than one in-tree driver. Short term potential solution: unload intel_vsec outside of KMM most likely using machine config.

Once the out of tree drivers are loaded, it is observed that unloading the drivers is difficult as they are always in use by GUI subcomponent i.e. framebuffer. The exact root cause is not determined but once the out of tree drivers are loaded, the GPU is actively used by a component in the system that prevents it from being unloaded. More exploration needed due to complexity to find root cause. lsof command was used to determine what was using the driver but did not provide any additional information.

Details:

2 components have changed:

  1. New GPU drivers/FW for RHEL 9.2
  2. New kernel for RHEL 9.2

KMM has a feature available on version 1.1.1 that can be used to unload 1 in-tree driver.
We can use this feature to unload in-tree i915. We cannot unload more than one kmod. We now have a use case to unload more than 1 in-tree driver. This includes i915 and intel_vsec for now and potentially cse in future.

3 Main Drivers for GPU: i915, intel_vsec (this is a prerequisite for i915), CSE (MEI)

Out of tree drivers behavior: Loading i915 driver will load the intel_vsec driver. Unloading i915 will unload intel_vsec.
In-tree driver behavior: Loading i915 does not load intel_vsec. Unloading i915 does not unload intel_vsec.

RHEL 9.2 OCP 4.13 has a new kernel based on 5.14.z upstream kernel. This is a huge jump from RHEL 8.6 based OCP 4.12 which used 4.18.z upstream kernel.

Initial smoke test analysis and Observed Impact:

There is an i915 and intel_vsec in-tree driver in RHEL 9.2 (not loaded by default, it is only loaded by kernel when it detects the GPU card via PCI device ID). These above 2 in-tree drivers do not support Intel GPU Flex or Max series. The in-tree i915 driver provides display support functionality for Intel Client Arc GPUs. As a result, customers will notice on dmesg the following message:

sh-5.1# dmesg | grep graphics
[   12.385679] i915 0000:33:00.0: Your graphics device 56c0 is not properly supported by the driver in this

[  478.732896] i915 0000:33:00.0: Your graphics device 56c0 is not properly supported by the driver in this

Intel® Data Center GPU Flex 170 -> PCI ID is 56c0.

Observation 1:

If in-tree intel_vsec is not unloaded prior to loading out of tree i915 driver, then unknown symbol errors observed in dmesg.

3238.466900] compat: loading out-of-tree module taints kernel.
[ 3238.466931] compat: module verification failed: signature and/or required key missing - tainting kernel
[ 3238.468361] COMPAT BACKPORTED INIT
[ 3238.468362] Loading modules backported from I915-23.6.37
[ 3238.468363] Backport generated by backports.git I915_23.6.37_PSB_230425.49
[ 3239.444973] i915: Unknown symbol intel_vsec_register (err -2)
[ 3271.091366] i915: Unknown symbol intel_vsec_register (err -2)
[ 3317.364301] i915: Unknown symbol intel_vsec_register (err -2)
[ 3376.362727] i915: Unknown symbol intel_vsec_register (err -2)

When we unload the in-tree intel_vsec driver and do nothing else different, the above issue is not observed.

Observation 2:

When you delete the KMM module CR, it unloads the out of tree i915 driver via a PreStop Hook, but it does not reload the in-tree i915 driver. This is by KMM design. Essentially, the kernel is tainted. When KMM tries to clean up, it is unable to unload the out of tree i915 driver as it says it is in use.

We are also unable to manually unload the out of tree i915 or intel_vsec driver.

sh-5.1# modprobe -rv intel_vsec
modprobe: FATAL: Module intel_vsec is in use.
sh-5.1# modprobe -rv i915      
modprobe: FATAL: Module i915 is in use.

lsmod output after out of tree drivers loaded, keep an eye on the resource counts which is the 3rd column.

sh-5.1# lsmod | grep i915

i915                 3977216  4

intel_vsec             20480  1 i915

intel_gtt              24576  1 i915

compat                 24576  2 intel_vsec,i915

video                  61440  1 i915

drm_display_helper    172032  2 compat,i915

cec                    61440  2 drm_display_helper,i915

i2c_algo_bit           16384  2 ast,i915

drm_kms_helper        192512  5 ast,drm_display_helper,i915

drm                   581632  7 drm_kms_helper,compat,ast,drm_shmem_helper,drm_display_helper,i915

sh-5.1# lsmod | grep intel_vsec

intel_vsec             20480  1 i915

compat                 24576  2 intel_vsec,i915

It has been noted to document a dependency list diagram for out of tree GPU drivers as a future exercise.

@hershpa
Copy link
Contributor Author

hershpa commented Nov 2, 2023

FYI @qbarrand, we now have a use case for unloading more than one in-tree driver.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant