Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

根据README的样例提交任务时,nvidia.com/gpu值超过1,pod的状态就一直为Pending #620

Open
peisp opened this issue Nov 15, 2024 · 1 comment
Labels
kind/bug Something isn't working

Comments

@peisp
Copy link

peisp commented Nov 15, 2024

What happened:
根据README的样例提交任务时,nvidia.com/gpu值超过1,pod的状态就一直为Pending
nvidia.com/gpu值为1时,pod调度正常
What you expected to happen:
nvidia.com/gpu值大于1,pod调度正常
How to reproduce it (as minimally and precisely as possible):

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  containers:
    - name: ubuntu-container
      image: ubuntu:18.04
      command: ["bash", "-c", "sleep 86400"]
      resources:
        limits:
          nvidia.com/gpu: 2 # 请求2个vGPUs
          nvidia.com/gpumem: 3000 # 每个vGPU申请3000m显存 (可选,整数类型)
          nvidia.com/gpucores: 30 # 每个vGPU的算力为30%实际显卡的算力 (可选,整数类型)

Anything else we need to know?:

  • The output of nvidia-smi -a on your host
==============NVSMI LOG==============

Timestamp                                 : Fri Nov 15 09:51:27 2024
Driver Version                            : 535.161.07
CUDA Version                              : 12.2

Attached GPUs                             : 1
GPU 00000000:23:00.0
    Product Name                          : NVIDIA H100 PCIe
    Product Brand                         : NVIDIA
    Product Architecture                  : Hopper
    Display Mode                          : Enabled
    Display Active                        : Disabled
    Persistence Mode                      : Disabled
    Addressing Mode                       : None
    MIG Mode
        Current                           : Disabled
        Pending                           : Disabled
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : 1650623011559
    GPU UUID                              : GPU-eb066683-1619-6849-17af-ef32671085ea
    Minor Number                          : 0
    VBIOS Version                         : 96.00.30.00.01
    MultiGPU Board                        : No
    Board ID                              : 0x2300
    Board Part Number                     : 900-21010-0000-000
    GPU Part Number                       : 2331-882-A1
    FRU Part Number                       : N/A
    Module ID                             : 4
    Inforom Version
        Image Version                     : 1010.0200.00.02
        OEM Object                        : 2.1
        ECC Object                        : 7.16
        Power Management Object           : N/A
    Inforom BBX Object Flush
        Latest Timestamp                  : N/A
        Latest Duration                   : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GSP Firmware Version                  : 535.161.07
    GPU Virtualization Mode
        Virtualization Mode               : Pass-Through
        Host VGPU Mode                    : N/A
    GPU Reset Status
        Reset Required                    : No
        Drain and Reset Recommended       : No
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x23
        Device                            : 0x00
        Domain                            : 0x0000
        Device Id                         : 0x233110DE
        Bus Id                            : 00000000:23:00.0
        Sub System Id                     : 0x162610DE
        GPU Link Info
            PCIe Generation
                Max                       : 5
                Current                   : 5
                Device Current            : 5
                Device Max                : 5
                Host Max                  : N/A
            Link Width
                Max                       : 16x
                Current                   : 16x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 550 KB/s
        Rx Throughput                     : 496 KB/s
        Atomic Caps Inbound               : N/A
        Atomic Caps Outbound              : N/A
    Fan Speed                             : N/A
    Performance State                     : P0
    Clocks Event Reasons
        Idle                              : Not Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    Sparse Operation Mode                 : Disabled
    FB Memory Usage
        Total                             : 81559 MiB
        Reserved                          : 551 MiB
        Used                              : 54668 MiB
        Free                              : 26339 MiB
    BAR1 Memory Usage
        Total                             : 131072 MiB
        Used                              : 8 MiB
        Free                              : 131064 MiB
    Conf Compute Protected Memory Usage
        Total                             : 0 MiB
        Used                              : 0 MiB
        Free                              : 0 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 0 %
        Memory                            : 0 %
        Encoder                           : 0 %
        Decoder                           : 0 %
        JPEG                              : 0 %
        OFA                               : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    ECC Mode
        Current                           : Enabled
        Pending                           : Enabled
    ECC Errors
        Volatile
            SRAM Correctable              : 0
            SRAM Uncorrectable Parity     : 0
            SRAM Uncorrectable SEC-DED    : 0
            DRAM Correctable              : 0
            DRAM Uncorrectable            : 0
        Aggregate
            SRAM Correctable              : 0
            SRAM Uncorrectable Parity     : 0
            SRAM Uncorrectable SEC-DED    : 0
            DRAM Correctable              : 0
            DRAM Uncorrectable            : 0
            SRAM Threshold Exceeded       : No
        Aggregate Uncorrectable SRAM Sources
            SRAM L2                       : 0
            SRAM SM                       : 0
            SRAM Microcontroller          : 0
            SRAM PCIE                     : 0
            SRAM Other                    : 0
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows
        Correctable Error                 : 0
        Uncorrectable Error               : 0
        Pending                           : No
        Remapping Failure Occurred        : No
        Bank Remap Availability Histogram
            Max                           : 1280 bank(s)
            High                          : 0 bank(s)
            Partial                       : 0 bank(s)
            Low                           : 0 bank(s)
            None                          : 0 bank(s)
    Temperature
        GPU Current Temp                  : 32 C
        GPU T.Limit Temp                  : 58 C
        GPU Shutdown T.Limit Temp         : -5 C
        GPU Slowdown T.Limit Temp         : -2 C
        GPU Max Operating T.Limit Temp    : 0 C
        GPU Target Temperature            : N/A
        Memory Current Temp               : 47 C
        Memory Max Operating T.Limit Temp : 0 C
    GPU Power Readings
        Power Draw                        : 73.14 W
        Current Power Limit               : 350.00 W
        Requested Power Limit             : 350.00 W
        Default Power Limit               : 310.00 W
        Min Power Limit                   : 200.00 W
        Max Power Limit                   : 350.00 W
    Module Power Readings
        Power Draw                        : N/A
        Current Power Limit               : N/A
        Requested Power Limit             : N/A
        Default Power Limit               : N/A
        Min Power Limit                   : N/A
        Max Power Limit                   : N/A
    Clocks
        Graphics                          : 1755 MHz
        SM                                : 1755 MHz
        Memory                            : 1593 MHz
        Video                             : 1440 MHz
    Applications Clocks
        Graphics                          : 1755 MHz
        Memory                            : 1593 MHz
    Default Applications Clocks
        Graphics                          : 1755 MHz
        Memory                            : 1593 MHz
    Deferred Clocks
        Memory                            : N/A
    Max Clocks
        Graphics                          : 1755 MHz
        SM                                : 1755 MHz
        Memory                            : 1593 MHz
        Video                             : 1470 MHz
    Max Customer Boost Clocks
        Graphics                          : 1755 MHz
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Voltage
        Graphics                          : 845.000 mV
    Fabric
        State                             : N/A
        Status                            : N/A
    Processes
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 424082
            Type                          : C
            Name                          : /home/klroot/anaconda3/envs/xinference-new/bin/python
            Used GPU Memory               : 1934 MiB
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 424562
            Type                          : C
            Name                          : /home/klroot/anaconda3/envs/xinference-new/bin/python
            Used GPU Memory               : 1742 MiB
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 424914
            Type                          : C
            Name                          : /home/klroot/anaconda3/envs/xinference-new/bin/python
            Used GPU Memory               : 50958 MiB


  • Your docker or containerd configuration file (e.g: /etc/docker/daemon.json)
  • The hami-device-plugin container logs
  • The hami-scheduler container logs
I1115 09:47:02.988487       1 reflector.go:559] vendor/k8s.io/client-go/informers/factory.go:150: Watch close - *v1.PersistentVolume total 7 items received
E1115 09:47:36.687564       1 event_broadcaster.go:253] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"gpu-pod.18081a82a942263c", GenerateName:"", Namespace:"default", SelfLink:"", UID:"fd789efa-b897-4e90-be84-50d41584de2c", ResourceVersion:"183374808", Generation:0, CreationTimestamp:time.Date(2024, time.November, 15, 9, 33, 25, 0, time.Local), DeletionTimestamp:<nil>, DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ManagedFields:[]v1.ManagedFieldsEntry{v1.ManagedFieldsEntry{Manager:"kube-scheduler", Operation:"Update", APIVersion:"events.k8s.io/v1", Time:time.Date(2024, time.November, 15, 9, 33, 25, 0, time.Local), FieldsType:"FieldsV1", FieldsV1:(*v1.FieldsV1)(0xc000b40600), Subresource:""}}}, EventTime:time.Date(2024, time.November, 15, 9, 33, 25, 211706000, time.Local), Series:(*v1.EventSeries)(0xc0017718c0), ReportingController:"hami-scheduler", ReportingInstance:"hami-scheduler-hami-scheduler-7695f8c86-f8s69", Action:"Scheduling", Reason:"FailedScheduling", Regarding:v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"gpu-pod", UID:"dcb6508a-fc29-45a1-8cd0-79c0dea9463c", APIVersion:"v1", ResourceVersion:"183372146", FieldPath:""}, Related:(*v1.ObjectReference)(nil), Note:"0/4 nodes are available: 1 node(s) had untolerated taint {key: value}, 3 node unregistered. preemption: 0/4 nodes are available: 1 Preemption is not helpful for scheduling, 3 No preemption victims found for incoming pod..", Type:"Warning", DeprecatedSource:v1.EventSource{Component:"", Host:""}, DeprecatedFirstTimestamp:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), DeprecatedLastTimestamp:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), DeprecatedCount:0}': 'Event "gpu-pod.18081a82a942263c" is invalid: series.count: Invalid value: "": should be at least 2' (will not retry!)
I1115 09:47:37.566028       1 eventhandlers.go:206] "Update event for scheduled pod" pod="knative-serving/net-istio-webhook-fb977d5d4-dd75c"
I1115 09:47:50.564260       1 eventhandlers.go:206] "Update event for scheduled pod" pod="knative-serving/net-istio-webhook-fb977d5d4-dd75c"
I1115 09:47:55.668991       1 eventhandlers.go:206] "Update event for scheduled pod" pod="knative-serving/domainmapping-webhook-7dd49d7948-9cgvt"
I1115 09:47:58.667695       1 eventhandlers.go:206] "Update event for scheduled pod" pod="knative-serving/activator-85fd9fddb7-hmmrr"
I1115 09:47:59.562275       1 eventhandlers.go:206] "Update event for scheduled pod" pod="knative-serving/net-istio-controller-c659dc8bd-9dqfn"
I1115 09:48:02.998668       1 reflector.go:559] vendor/k8s.io/client-go/informers/factory.go:150: Watch close - *v1.CSIStorageCapacity total 10 items received
I1115 09:48:07.564758       1 eventhandlers.go:206] "Update event for scheduled pod" pod="knative-serving/domain-mapping-77d5f7867d-sbnrx"
I1115 09:48:07.669283       1 eventhandlers.go:206] "Update event for scheduled pod" pod="knative-serving/domainmapping-webhook-7dd49d7948-9cgvt"
I1115 09:48:10.563628       1 eventhandlers.go:206] "Update event for scheduled pod" pod="knative-serving/net-istio-controller-c659dc8bd-9dqfn"
I1115 09:48:10.668266       1 eventhandlers.go:206] "Update event for scheduled pod" pod="knative-serving/activator-85fd9fddb7-hmmrr"
I1115 09:48:13.564968       1 eventhandlers.go:206] "Update event for scheduled pod" pod="knative-serving/autoscaler-7df77c9857-jb4qd"
I1115 09:48:15.669268       1 eventhandlers.go:206] "Update event for scheduled pod" pod="knative-serving/webhook-8779d4f95-q9vp7"
I1115 09:48:19.564622       1 eventhandlers.go:206] "Update event for scheduled pod" pod="knative-serving/domain-mapping-77d5f7867d-sbnrx"
I1115 09:48:21.669417       1 eventhandlers.go:206] "Update event for scheduled pod" pod="knative-serving/controller-79d7cc489f-bpsdz"
I1115 09:48:25.564416       1 eventhandlers.go:206] "Update event for scheduled pod" pod="knative-serving/autoscaler-7df77c9857-jb4qd"
I1115 09:48:29.669623       1 eventhandlers.go:206] "Update event for scheduled pod" pod="knative-serving/webhook-8779d4f95-q9vp7"
I1115 09:48:33.679134       1 eventhandlers.go:206] "Update event for scheduled pod" pod="knative-serving/controller-79d7cc489f-bpsdz"
I1115 09:48:37.068889       1 reflector.go:559] vendor/k8s.io/client-go/informers/factory.go:150: Watch close - *v1.Node total 48 items received
I1115 09:48:54.812032       1 eventhandlers.go:206] "Update event for scheduled pod" pod="kubeflow/kserve-controller-manager-5877985898-sz4km"
I1115 09:48:55.856551       1 eventhandlers.go:206] "Update event for scheduled pod" pod="kubeflow/kserve-controller-manager-5877985898-sz4km"
I1115 09:49:10.046900       1 reflector.go:559] pkg/authentication/request/headerrequest/requestheader_controller.go:172: Watch close - *v1.ConfigMap total 9 items received
I1115 09:49:19.049217       1 reflector.go:559] vendor/k8s.io/client-go/informers/factory.go:150: Watch close - *v1.ReplicaSet total 36 items received
I1115 09:49:29.074580       1 reflector.go:559] vendor/k8s.io/client-go/informers/factory.go:150: Watch close - *v1.PersistentVolumeClaim total 7 items received
I1115 09:49:34.102908       1 reflector.go:559] vendor/k8s.io/client-go/informers/factory.go:150: Watch close - *v1.StatefulSet total 10 items received
I1115 09:49:51.070903       1 reflector.go:559] vendor/k8s.io/client-go/informers/factory.go:150: Watch close - *v1.CSIDriver total 10 items received
I1115 09:50:00.041753       1 reflector.go:559] vendor/k8s.io/client-go/informers/factory.go:150: Watch close - *v1.CSINode total 9 items received
I1115 09:50:06.042910       1 reflector.go:559] vendor/k8s.io/client-go/informers/factory.go:150: Watch close - *v1.Namespace total 11 items received
I1115 09:50:42.698713       1 eventhandlers.go:118] "Add event for unscheduled pod" pod="default/gpu-pod"
I1115 09:50:42.698814       1 scheduling_queue.go:1066] "About to try and schedule pod" pod="default/gpu-pod"
I1115 09:50:42.698829       1 schedule_one.go:81] "Attempting to schedule pod" pod="default/gpu-pod"
I1115 09:50:42.702282       1 schedule_one.go:854] "Unable to schedule pod; no fit; waiting" pod="default/gpu-pod" err="0/4 nodes are available: 3 node unregistered. preemption: 0/4 nodes are available: 4 No preemption victims found for incoming pod.."
I1115 09:50:42.702363       1 schedule_one.go:930] "Updating pod condition" pod="default/gpu-pod" conditionType=PodScheduled conditionStatus=False conditionReason="Unschedulable"
I1115 09:50:47.667397       1 eventhandlers.go:206] "Update event for scheduled pod" pod="knative-eventing/eventing-webhook-645cc4cfd5-fc8rx"
I1115 09:50:55.664267       1 eventhandlers.go:206] "Update event for scheduled pod" pod="knative-eventing/eventing-controller-bd7666b7-xfvvg"
I1115 09:50:58.670268       1 eventhandlers.go:206] "Update event for scheduled pod" pod="knative-eventing/eventing-webhook-645cc4cfd5-fc8rx"
I1115 09:51:09.665640       1 eventhandlers.go:206] "Update event for scheduled pod" pod="knative-eventing/eventing-controller-bd7666b7-xfvvg"
I1115 09:51:17.201626       1 eventhandlers.go:206] "Update event for scheduled pod" pod="kubeflow/kserve-controller-manager-5877985898-sz4km"
I1115 09:51:21.047233       1 reflector.go:559] vendor/k8s.io/client-go/informers/factory.go:150: Watch close - *v1.StorageClass total 11 items received
I1115 09:51:31.668065       1 eventhandlers.go:206] "Update event for scheduled pod" pod="kubeflow/kserve-controller-manager-5877985898-sz4km"
I1115 09:51:32.591684       1 eventhandlers.go:206] "Update event for scheduled pod" pod="kubeflow/kserve-controller-manager-5877985898-sz4km"
I1115 09:51:35.024203       1 reflector.go:559] pkg/server/dynamiccertificates/configmap_cafile_content.go:206: Watch close - *v1.ConfigMap total 11 items received
I1115 09:51:39.019926       1 reflector.go:559] pkg/server/dynamiccertificates/configmap_cafile_content.go:206: Watch close - *v1.ConfigMap total 9 items received
I1115 09:51:59.059421       1 reflector.go:559] vendor/k8s.io/client-go/informers/factory.go:150: Watch close - *v1.Pod total 85 items received
I1115 09:52:28.020843       1 reflector.go:559] vendor/k8s.io/client-go/informers/factory.go:150: Watch close - *v1.PodDisruptionBudget total 7 items received
I1115 09:52:39.562821       1 eventhandlers.go:206] "Update event for scheduled pod" pod="knative-serving/net-istio-webhook-fb977d5d4-dd75c"
I1115 09:52:48.990499       1 reflector.go:559] vendor/k8s.io/client-go/informers/factory.go:150: Watch close - *v1.PersistentVolume total 6 items received
I1115 09:52:50.569504       1 eventhandlers.go:206] "Update event for scheduled pod" pod="knative-serving/net-istio-webhook-fb977d5d4-dd75c"
I1115 09:52:51.071117       1 reflector.go:559] vendor/k8s.io/client-go/informers/factory.go:150: Watch close - *v1.ReplicationController total 8 items received
I1115 09:53:06.670966       1 eventhandlers.go:206] "Update event for scheduled pod" pod="knative-serving/domainmapping-webhook-7dd49d7948-9cgvt"
I1115 09:53:07.564147       1 eventhandlers.go:206] "Update event for scheduled pod" pod="knative-serving/net-istio-controller-c659dc8bd-9dqfn"
I1115 09:53:10.668667       1 eventhandlers.go:206] "Update event for scheduled pod" pod="knative-serving/activator-85fd9fddb7-hmmrr"
I1115 09:53:11.565236       1 eventhandlers.go:206] "Update event for scheduled pod" pod="knative-serving/domain-mapping-77d5f7867d-sbnrx"
I1115 09:53:12.669254       1 eventhandlers.go:206] "Update event for scheduled pod" pod="knative-serving/webhook-8779d4f95-q9vp7"
I1115 09:53:17.563255       1 eventhandlers.go:206] "Update event for scheduled pod" pod="knative-serving/autoscaler-7df77c9857-jb4qd"
I1115 09:53:21.674002       1 eventhandlers.go:206] "Update event for scheduled pod" pod="knative-serving/domainmapping-webhook-7dd49d7948-9cgvt"
I1115 09:53:22.564365       1 eventhandlers.go:206] "Update event for scheduled pod" pod="knative-serving/net-istio-controller-c659dc8bd-9dqfn"
I1115 09:53:22.669255       1 eventhandlers.go:206] "Update event for scheduled pod" pod="knative-serving/activator-85fd9fddb7-hmmrr"
I1115 09:53:22.682750       1 eventhandlers.go:206] "Update event for scheduled pod" pod="knative-serving/controller-79d7cc489f-bpsdz"
I1115 09:53:24.566394       1 eventhandlers.go:206] "Update event for scheduled pod" pod="knative-serving/domain-mapping-77d5f7867d-sbnrx"
I1115 09:53:26.669414       1 eventhandlers.go:206] "Update event for scheduled pod" pod="knative-serving/webhook-8779d4f95-q9vp7"
I1115 09:53:31.565206       1 eventhandlers.go:206] "Update event for scheduled pod" pod="knative-serving/autoscaler-7df77c9857-jb4qd"
I1115 09:53:33.001704       1 reflector.go:559] vendor/k8s.io/client-go/informers/factory.go:150: Watch close - *v1.Service total 9 items received
I1115 09:53:33.667382       1 eventhandlers.go:206] "Update event for scheduled pod" pod="knative-serving/controller-79d7cc489f-bpsdz"
I1115 09:53:54.011106       1 eventhandlers.go:206] "Update event for scheduled pod" pod="kubeflow/kserve-controller-manager-5877985898-sz4km"
I1115 09:54:04.659226       1 eventhandlers.go:206] "Update event for scheduled pod" pod="kubeflow/kserve-controller-manager-5877985898-sz4km"
I1115 09:54:20.703736       1 eventhandlers.go:206] "Update event for scheduled pod" pod="kubeflow/kserve-controller-manager-5877985898-sz4km"
I1115 09:54:41.000927       1 reflector.go:559] vendor/k8s.io/client-go/informers/factory.go:150: Watch close - *v1.CSIStorageCapacity total 7 items received
I1115 09:55:36.043453       1 reflector.go:559] vendor/k8s.io/client-go/informers/factory.go:150: Watch close - *v1.CSINode total 7 items received
I1115 09:55:41.079195       1 reflector.go:559] vendor/k8s.io/client-go/informers/factory.go:150: Watch close - *v1.CSIDriver total 7 items received
I1115 09:55:53.665287       1 eventhandlers.go:206] "Update event for scheduled pod" pod="knative-eventing/eventing-webhook-645cc4cfd5-fc8rx"
I1115 09:55:55.235602       1 scheduling_queue.go:1066] "About to try and schedule pod" pod="default/gpu-pod"
I1115 09:55:55.235641       1 schedule_one.go:81] "Attempting to schedule pod" pod="default/gpu-pod"
I1115 09:55:55.240368       1 schedule_one.go:854] "Unable to schedule pod; no fit; waiting" pod="default/gpu-pod" err="0/4 nodes are available: 3 node unregistered. preemption: 0/4 nodes are available: 4 No preemption victims found for incoming pod.."
I1115 09:55:55.240463       1 schedule_one.go:930] "Updating pod condition" pod="default/gpu-pod" conditionType=PodScheduled conditionStatus=False conditionReason="Unschedulable"
I1115 09:55:56.072441       1 reflector.go:559] vendor/k8s.io/client-go/informers/factory.go:150: Watch close - *v1.Node total 45 items received
I1115 09:56:04.666849       1 eventhandlers.go:206] "Update event for scheduled pod" pod="knative-eventing/eventing-webhook-645cc4cfd5-fc8rx"
I1115 09:56:08.665952       1 eventhandlers.go:206] "Update event for scheduled pod" pod="knative-eventing/eventing-controller-bd7666b7-xfvvg"
I1115 09:56:22.665933       1 eventhandlers.go:206] "Update event for scheduled pod" pod="knative-eventing/eventing-controller-bd7666b7-xfvvg"
I1115 09:56:43.123300       1 eventhandlers.go:206] "Update event for scheduled pod" pod="kubeflow/kserve-controller-manager-5877985898-sz4km"
I1115 09:56:49.281652       1 eventhandlers.go:206] "Update event for scheduled pod" pod="kubeflow/kserve-controller-manager-5877985898-sz4km"
I1115 09:56:49.296005       1 eventhandlers.go:206] "Update event for scheduled pod" pod="kubeflow/kserve-controller-manager-5877985898-sz4km"
I1115 09:56:49.463331       1 eventhandlers.go:206] "Update event for scheduled pod" pod="kubeflow/kserve-controller-manager-5877985898-sz4km"
I1115 09:56:50.338671       1 eventhandlers.go:206] "Update event for scheduled pod" pod="kubeflow/kserve-controller-manager-5877985898-sz4km"
I1115 09:56:50.342514       1 eventhandlers.go:206] "Update event for scheduled pod" pod="kubeflow/kserve-controller-manager-5877985898-sz4km"
I1115 09:56:50.344667       1 eventhandlers.go:231] "Delete event for scheduled pod" pod="kubeflow/kserve-controller-manager-5877985898-sz4km"
I1115 09:56:50.344728       1 scheduling_queue.go:1066] "About to try and schedule pod" pod="default/gpu-pod"
I1115 09:56:50.344738       1 schedule_one.go:81] "Attempting to schedule pod" pod="default/gpu-pod"
I1115 09:56:50.346186       1 schedule_one.go:854] "Unable to schedule pod; no fit; waiting" pod="default/gpu-pod" err="0/4 nodes are available: 3 node unregistered. preemption: 0/4 nodes are available: 4 No preemption victims found for incoming pod.."
I1115 09:56:50.346250       1 schedule_one.go:930] "Updating pod condition" pod="default/gpu-pod" conditionType=PodScheduled conditionStatus=False conditionReason="Unschedulable"
E1115 09:56:50.348692       1 event_broadcaster.go:253] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"gpu-pod.18081bbcfd3db7f6", GenerateName:"", Namespace:"default", SelfLink:"", UID:"fbbba899-0f73-4a91-b22c-1de0376259cf", ResourceVersion:"183384634", Generation:0, CreationTimestamp:time.Date(2024, time.November, 15, 9, 55, 55, 0, time.Local), DeletionTimestamp:<nil>, DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ManagedFields:[]v1.ManagedFieldsEntry{v1.ManagedFieldsEntry{Manager:"kube-scheduler", Operation:"Update", APIVersion:"events.k8s.io/v1", Time:time.Date(2024, time.November, 15, 9, 55, 55, 0, time.Local), FieldsType:"FieldsV1", FieldsV1:(*v1.FieldsV1)(0xc00157c1e0), Subresource:""}}}, EventTime:time.Date(2024, time.November, 15, 9, 55, 55, 240432000, time.Local), Series:(*v1.EventSeries)(0xc00192cd00), ReportingController:"hami-scheduler", ReportingInstance:"hami-scheduler-hami-scheduler-7695f8c86-f8s69", Action:"Scheduling", Reason:"FailedScheduling", Regarding:v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"gpu-pod", UID:"c6c8a25d-5810-4e22-beb1-061292c5fa3c", APIVersion:"v1", ResourceVersion:"183382502", FieldPath:""}, Related:(*v1.ObjectReference)(nil), Note:"0/4 nodes are available: 3 node unregistered. preemption: 0/4 nodes are available: 4 No preemption victims found for incoming pod..", Type:"Warning", DeprecatedSource:v1.EventSource{Component:"", Host:""}, DeprecatedFirstTimestamp:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), DeprecatedLastTimestamp:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), DeprecatedCount:0}': 'Event "gpu-pod.18081bbcfd3db7f6" is invalid: series.count: Invalid value: "": should be at least 2' (will not retry!)
I1115 09:56:58.044503       1 reflector.go:559] vendor/k8s.io/client-go/informers/factory.go:150: Watch close - *v1.Namespace total 8 items received
I1115 09:56:58.333582       1 eventhandlers.go:206] "Update event for scheduled pod" pod="kubeflow/kserve-models-web-app-58df76d4b7-q7lxd"
I1115 09:57:03.597400       1 eventhandlers.go:206] "Update event for scheduled pod" pod="kubeflow/kserve-models-web-app-58df76d4b7-q7lxd"
I1115 09:57:04.830244       1 eventhandlers.go:206] "Update event for scheduled pod" pod="kubeflow/kserve-models-web-app-58df76d4b7-q7lxd"
I1115 09:57:04.836720       1 eventhandlers.go:206] "Update event for scheduled pod" pod="kubeflow/kserve-models-web-app-58df76d4b7-q7lxd"
I1115 09:57:04.840955       1 eventhandlers.go:231] "Delete event for scheduled pod" pod="kubeflow/kserve-models-web-app-58df76d4b7-q7lxd"
I1115 09:57:04.841043       1 scheduling_queue.go:1066] "About to try and schedule pod" pod="default/gpu-pod"
I1115 09:57:04.841059       1 schedule_one.go:81] "Attempting to schedule pod" pod="default/gpu-pod"
I1115 09:57:04.843010       1 schedule_one.go:854] "Unable to schedule pod; no fit; waiting" pod="default/gpu-pod" err="0/4 nodes are available: 3 node unregistered. preemption: 0/4 nodes are available: 4 No preemption victims found for incoming pod.."
I1115 09:57:04.843082       1 schedule_one.go:930] "Updating pod condition" pod="default/gpu-pod" conditionType=PodScheduled conditionStatus=False conditionReason="Unschedulable"
I1115 09:57:05.254827       1 eventhandlers.go:206] "Update event for scheduled pod" pod="kubeflow/centraldashboard-7c68945c67-xvv48"
I1115 09:57:10.516487       1 eventhandlers.go:206] "Update event for scheduled pod" pod="kubeflow/centraldashboard-7c68945c67-xvv48"
I1115 09:57:11.000103       1 eventhandlers.go:206] "Update event for scheduled pod" pod="kubeflow/centraldashboard-7c68945c67-xvv48"
I1115 09:57:11.006326       1 eventhandlers.go:206] "Update event for scheduled pod" pod="kubeflow/centraldashboard-7c68945c67-xvv48"
I1115 09:57:11.008425       1 eventhandlers.go:231] "Delete event for scheduled pod" pod="kubeflow/centraldashboard-7c68945c67-xvv48"
I1115 09:57:13.369480       1 scheduling_queue.go:1066] "About to try and schedule pod" pod="default/gpu-pod"
I1115 09:57:13.369511       1 schedule_one.go:81] "Attempting to schedule pod" pod="default/gpu-pod"
I1115 09:57:13.371608       1 schedule_one.go:854] "Unable to schedule pod; no fit; waiting" pod="default/gpu-pod" err="0/4 nodes are available: 3 node unregistered. preemption: 0/4 nodes are available: 4 No preemption victims found for incoming pod.."
I1115 09:57:13.371718       1 schedule_one.go:930] "Updating pod condition" pod="default/gpu-pod" conditionType=PodScheduled conditionStatus=False conditionReason="Unschedulable"
I1115 09:57:22.051388       1 reflector.go:559] vendor/k8s.io/client-go/informers/factory.go:150: Watch close - *v1.ReplicaSet total 16 items received
I1115 09:57:22.061672       1 reflector.go:559] vendor/k8s.io/client-go/informers/factory.go:150: Watch close - *v1.Pod total 47 items received
I1115 09:57:37.569820       1 eventhandlers.go:206] "Update event for scheduled pod" pod="knative-serving/net-istio-webhook-fb977d5d4-dd75c"
I1115 09:57:52.563586       1 eventhandlers.go:206] "Update event for scheduled pod" pod="knative-serving/net-istio-webhook-fb977d5d4-dd75c"
I1115 09:57:53.022445       1 reflector.go:559] vendor/k8s.io/client-go/informers/factory.go:150: Watch close - *v1.PodDisruptionBudget total 6 items received
I1115 09:58:11.668296       1 eventhandlers.go:206] "Update event for scheduled pod" pod="knative-serving/domainmapping-webhook-7dd49d7948-9cgvt"
I1115 09:58:12.667001       1 eventhandlers.go:206] "Update event for scheduled pod" pod="knative-serving/activator-85fd9fddb7-hmmrr"
I1115 09:58:16.563274       1 eventhandlers.go:206] "Update event for scheduled pod" pod="knative-serving/net-istio-controller-c659dc8bd-9dqfn"
I1115 09:58:21.701798       1 eventhandlers.go:206] "Update event for scheduled pod" pod="knative-serving/webhook-8779d4f95-q9vp7"
I1115 09:58:23.563945       1 eventhandlers.go:206] "Update event for scheduled pod" pod="knative-serving/autoscaler-7df77c9857-jb4qd"
I1115 09:58:23.667376       1 eventhandlers.go:206] "Update event for scheduled pod" pod="knative-serving/domainmapping-webhook-7dd49d7948-9cgvt"
I1115 09:58:23.679107       1 eventhandlers.go:206] "Update event for scheduled pod" pod="knative-serving/controller-79d7cc489f-bpsdz"
I1115 09:58:24.049630       1 reflector.go:559] vendor/k8s.io/client-go/informers/factory.go:150: Watch close - *v1.StorageClass total 7 items received
I1115 09:58:25.667507       1 eventhandlers.go:206] "Update event for scheduled pod" pod="knative-serving/activator-85fd9fddb7-hmmrr"
I1115 09:58:27.104491       1 reflector.go:559] vendor/k8s.io/client-go/informers/factory.go:150: Watch close - *v1.StatefulSet total 9 items received
I1115 09:58:28.564707       1 eventhandlers.go:206] "Update event for scheduled pod" pod="knative-serving/net-istio-controller-c659dc8bd-9dqfn"
I1115 09:58:29.076123       1 reflector.go:559] vendor/k8s.io/client-go/informers/factory.go:150: Watch close - *v1.PersistentVolumeClaim total 10 items received
I1115 09:58:29.563449       1 eventhandlers.go:206] "Update event for scheduled pod" pod="knative-serving/domain-mapping-77d5f7867d-sbnrx"
I1115 09:58:35.686549       1 eventhandlers.go:206] "Update event for scheduled pod" pod="knative-serving/webhook-8779d4f95-q9vp7"
I1115 09:58:38.564126       1 eventhandlers.go:206] "Update event for scheduled pod" pod="knative-serving/autoscaler-7df77c9857-jb4qd"
I1115 09:58:39.666450       1 eventhandlers.go:206] "Update event for scheduled pod" pod="knative-serving/controller-79d7cc489f-bpsdz"
I1115 09:58:42.565028       1 eventhandlers.go:206] "Update event for scheduled pod" pod="knative-serving/domain-mapping-77d5f7867d-sbnrx"
I1115 09:58:46.049519       1 reflector.go:559] pkg/authentication/request/headerrequest/requestheader_controller.go:172: Watch close - *v1.ConfigMap total 11 items received
I1115 09:59:19.026688       1 reflector.go:559] pkg/server/dynamiccertificates/configmap_cafile_content.go:206: Watch close - *v1.ConfigMap total 8 items received
I1115 09:59:26.072989       1 reflector.go:559] vendor/k8s.io/client-go/informers/factory.go:150: Watch close - *v1.ReplicationController total 7 items received
I1115 09:59:33.023876       1 reflector.go:559] pkg/server/dynamiccertificates/configmap_cafile_content.go:206: Watch close - *v1.ConfigMap total 9 items received
  • The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)
  • Any relevant kernel output lines from dmesg

Environment:

  • HAMi version: 2.4.0
  • nvidia driver or other AI device driver version:
  • Docker version from docker version
  • Docker command, image and tag used
  • Kernel version from uname -a
  • Others:
@peisp peisp added the kind/bug Something isn't working label Nov 15, 2024
@lixd
Copy link

lixd commented Nov 15, 2024

The number of requested vGPUs cannot exceed the number of physical GPUs in your node.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants