Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mig config failed #157

Open
MLintin opened this issue Dec 31, 2024 · 1 comment
Open

mig config failed #157

MLintin opened this issue Dec 31, 2024 · 1 comment

Comments

@MLintin
Copy link

MLintin commented Dec 31, 2024

i have two A30 gpu
if i config with below, it will success

version: v1
mig-configs:
  all-disabled:
  - devices: all
    mig-enabled: false
  node1:
  - devices:
    - 0
    mig-enabled: true
    mig-devices:
      1g.6gb: 4

i config with below, it will success too

version: v1
mig-configs:
  all-disabled:
  - devices: all
    mig-enabled: false
  node1:
  - devices:
    - 0
    - 1
    mig-enabled: true
    mig-devices:
      1g.6gb: 4

but if i config with below, it will failed

version: v1
mig-configs:
  all-disabled:
  - devices: all
    mig-enabled: false
  node1:
  - devices:
    - 1
    mig-enabled: true
    mig-devices:
      1g.6gb: 4

logs:
Applying the MIG mode change from the selected config to the node (and double checking it took effect)
If the -r option was passed, the node will be automatically rebooted if this is not successful
time="2024-12-31T06:44:06Z" level=debug msg="Parsing config file..."
time="2024-12-31T06:44:06Z" level=debug msg="Selecting specific MIG config..."
time="2024-12-31T06:44:06Z" level=debug msg="Running apply-start hook"
time="2024-12-31T06:44:06Z" level=debug msg="Checking current MIG mode..."
time="2024-12-31T06:44:08Z" level=debug msg="Walking MigConfig for (devices=[1])"
time="2024-12-31T06:44:08Z" level=debug msg=" GPU 1: 0x20B710DE"
time="2024-12-31T06:44:08Z" level=debug msg=" Asserting MIG mode: Enabled"
time="2024-12-31T06:44:08Z" level=debug msg=" MIG capable: true\n"
time="2024-12-31T06:44:08Z" level=debug msg=" Current MIG mode: Disabled"
time="2024-12-31T06:44:10Z" level=debug msg="Running pre-apply-mode hook"
time="2024-12-31T06:44:10Z" level=debug msg="Applying MIG mode change..."
time="2024-12-31T06:44:13Z" level=debug msg="Walking MigConfig for (devices=[1])"
time="2024-12-31T06:44:13Z" level=debug msg=" GPU 1: 0x20B710DE"
time="2024-12-31T06:44:13Z" level=debug msg=" MIG capable: true\n"
time="2024-12-31T06:44:13Z" level=debug msg=" Current MIG mode: Disabled"
time="2024-12-31T06:44:13Z" level=debug msg=" Updating MIG mode: Enabled"
time="2024-12-31T06:44:17Z" level=debug msg=" Mode change pending: false"
time="2024-12-31T06:44:19Z" level=debug msg="Running apply-exit hook"
MIG configuration applied successfully
time="2024-12-31T06:44:19Z" level=debug msg="Parsing config file..."
time="2024-12-31T06:44:19Z" level=debug msg="Selecting specific MIG config..."
time="2024-12-31T06:44:19Z" level=debug msg="Asserting MIG mode configuration..."
time="2024-12-31T06:44:22Z" level=debug msg="Walking MigConfig for (devices=[1])"
time="2024-12-31T06:44:22Z" level=debug msg=" GPU 1: 0x20B710DE"
time="2024-12-31T06:44:22Z" level=debug msg=" Asserting MIG mode: Enabled"
time="2024-12-31T06:44:22Z" level=debug msg=" MIG capable: true\n"
time="2024-12-31T06:44:22Z" level=debug msg=" Current MIG mode: Enabled"
Selected MIG mode settings from configuration currently applied
Applying the selected MIG config to the node
time="2024-12-31T06:44:23Z" level=debug msg="Parsing config file..."
time="2024-12-31T06:44:23Z" level=debug msg="Selecting specific MIG config..."
time="2024-12-31T06:44:23Z" level=debug msg="Running apply-start hook"
time="2024-12-31T06:44:23Z" level=debug msg="Checking current MIG mode..."
time="2024-12-31T06:44:26Z" level=debug msg="Walking MigConfig for (devices=[1])"
time="2024-12-31T06:44:26Z" level=debug msg=" GPU 1: 0x20B710DE"
time="2024-12-31T06:44:26Z" level=debug msg=" Asserting MIG mode: Enabled"
time="2024-12-31T06:44:26Z" level=debug msg=" MIG capable: true\n"
time="2024-12-31T06:44:26Z" level=debug msg=" Current MIG mode: Enabled"
time="2024-12-31T06:44:28Z" level=debug msg="Checking current MIG device configuration..."
time="2024-12-31T06:44:30Z" level=debug msg="Walking MigConfig for (devices=[1])"
time="2024-12-31T06:44:30Z" level=debug msg=" GPU 1: 0x20B710DE"
time="2024-12-31T06:44:30Z" level=debug msg=" Asserting MIG config: map[1g.6gb:4]"
time="2024-12-31T06:44:32Z" level=debug msg="Running pre-apply-config hook"
time="2024-12-31T06:44:32Z" level=debug msg="Applying MIG device configuration..."
time="2024-12-31T06:44:35Z" level=debug msg="Walking MigConfig for (devices=[1])"
time="2024-12-31T06:44:35Z" level=debug msg=" GPU 1: 0x20B710DE"
time="2024-12-31T06:44:35Z" level=debug msg=" MIG capable: true\n"
time="2024-12-31T06:44:35Z" level=debug msg=" Updating MIG config: map[1g.6gb:4]"
time="2024-12-31T06:44:35Z" level=error msg="Error getting GPU instance profile info for '1g.6gb': ERROR_NOT_SUPPORTED"
time="2024-12-31T06:44:37Z" level=debug msg="Running apply-exit hook"
time="2024-12-31T06:44:37Z" level=fatal msg="Error applying MIG configuration with hooks: error setting MIGConfig: error attempting multiple config orderings: all orderings failed"
Restarting any GPU clients previously shutdown on the host by restarting their systemd services
Starting kubelet.service

@MLintin
Copy link
Author

MLintin commented Jan 7, 2025

Additional information:
Figure 1:
Image

Code version:
0.9.1

Reason:
pkg/mig/config/config.go:305:
return iterate(config.Flatten(), f, 0)

When matching the GPU through the mig config to obtain the nvdev.MigProfileInfo, unconfigured Gpus(gpu 0) are not filtered out.
As shown in Figure 1,In my config,it will match gpu 0 gpu instance profile 15 but not gpu 1 gpu instance profile 14, becase they have the same name: 1g.6gb.
so when pass pkg/mig/config/config.go:153(giProfileInfo, ret := device.GetGpuInstanceProfileInfo(mp.GIProfileID)) to get GpuInstanceProfileInfo, the wrong mp.GIProfileID is used, led to this failure, "Error getting GPU instance profile info for '1g.6gb': ERROR_NOT_SUPPORTED"

My question:
Is this a bug? or my usage scenario is wrong?

If it is a bug, can you give me a general repair plan?
If I use the wrong way, can you tell me how to use it?

Thank you very much!
Looking forward to your reply.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant