Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Driver error when killing dp_service #222

Open
PlagueCZ opened this issue Mar 1, 2023 · 0 comments
Open

Driver error when killing dp_service #222

PlagueCZ opened this issue Mar 1, 2023 · 0 comments
Labels
bug Something isn't working

Comments

@PlagueCZ
Copy link
Contributor

PlagueCZ commented Mar 1, 2023

Almost every time the k8s setup of @FlorinPeter kills dp-service (either becuase of init timeout or just replacing the pod), there is a kernel warning being produced.

[Mon Feb 27 21:49:45 2023] ------------[ cut here ]------------
[Mon Feb 27 21:49:45 2023] WARNING: CPU: 0 PID: 8667 at drivers/iommu/dma-iommu.c:1038 iommu_dma_unmap_page+0x79/0x90
[Mon Feb 27 21:49:45 2023] Modules linked in: vfio_pci vfio_pci_core vfio_virqfd vfio_iommu_type1 vfio cls_bpf sch_ingress xt_TPROXY nf_tproxy_ipv6 nf_tproxy_ipv4 ip_set xt_CT algif_hash af_alg veth xfrm_user xfrm_algo xt_socket nf_socket_ipv4 nf_socket_ipv6 ip6table_raw ip6table_mangle iptable_raw iptable_mangle xt_MASQUERADE nft_chain_nat xt_mark xt_conntrack xt_comment nft_compat nf_tables nfnetlink ip6table_filter ip6table_nat ip6_tables iptable_filter iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c coretemp intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common i10nm_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp kvm_intel nvme_fabrics kvm irqbypass rapl iTCO_wdt mgag200 intel_pmc_bxt drm_shmem_helper iTCO_vendor_support mlx5_ib drm_kms_helper watchdog intel_cstate ipmi_ssif sunrpc binfmt_misc ib_uverbs mei_me ib_core i2c_algo_bit isst_if_mmio ioatdma ipmi_si i2c_i801 isst_if_mbox_pci intel_uncore pcspkr mei i2c_smbus evdev
[Mon Feb 27 21:49:45 2023]  isst_if_common joydev intel_pch_thermal intel_vsec dca ipmi_devintf ipmi_msghandler acpi_power_meter acpi_pad button br_netfilter bridge stp llc fuse drm efi_pstore dm_mod configfs ip_tables x_tables overlay squashfs loop ext4 crc16 mbcache jbd2 crc32c_generic hid_generic usbhid hid crc32_pclmul crc32c_intel nvme nvme_core ghash_clmulni_intel hwmon sha512_ssse3 mlx5_core t10_pi ahci xhci_pci libahci crc64_rocksoft_generic xhci_hcd libata crc64_rocksoft mlxfw crc_t10dif aesni_intel psample crypto_simd ptp crct10dif_generic crct10dif_pclmul usbcore scsi_mod cryptd pps_core crc64 wmi pci_hyperv_intf usb_common crct10dif_common scsi_common cfg80211 rfkill efivarfs autofs4
[Mon Feb 27 21:49:45 2023] CPU: 0 PID: 8667 Comm: dp_service Not tainted 6.1.13-gardenlinux-amd64 #1  Debian 6.1.13-0gardenlinux1
[Mon Feb 27 21:49:45 2023] Hardware name: Lenovo ThinkSystem SR630 V2/7Z71CTO1WW, BIOS AFE118I-1.30 04/21/2022
[Mon Feb 27 21:49:45 2023] RIP: 0010:iommu_dma_unmap_page+0x79/0x90
[Mon Feb 27 21:49:45 2023] Code: 2b 48 3b 28 72 26 48 3b 68 08 73 20 4d 89 f8 44 89 f1 4c 89 ea 48 89 ee 48 89 df 5b 5d 41 5c 41 5d 41 5e 41 5f e9 27 b6 a6 ff <0f> 0b 5b 5d 41 5c 41 5d 41 5e 41 5f c3 cc cc cc cc 66 0f 1f 44 00
[Mon Feb 27 21:49:45 2023] RSP: 0018:ff6e22db8c83fb18 EFLAGS: 00010246
[Mon Feb 27 21:49:45 2023] RAX: 0000000000000000 RBX: ff2c9e75106e30d0 RCX: 0000000000000009
[Mon Feb 27 21:49:45 2023] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000002
[Mon Feb 27 21:49:45 2023] RBP: ff2c9df5ced4a000 R08: 0000000000000002 R09: ff6e22db8c83fb48
[Mon Feb 27 21:49:45 2023] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[Mon Feb 27 21:49:45 2023] R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000000
[Mon Feb 27 21:49:45 2023] FS:  0000000000000000(0000) GS:ff2c9e72ff600000(0000) knlGS:0000000000000000
[Mon Feb 27 21:49:45 2023] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Mon Feb 27 21:49:45 2023] CR2: 00007fddc11a01d0 CR3: 00000056f5a10003 CR4: 0000000000773ef0
[Mon Feb 27 21:49:45 2023] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[Mon Feb 27 21:49:45 2023] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[Mon Feb 27 21:49:45 2023] PKRU: 55555554
[Mon Feb 27 21:49:45 2023] Call Trace:
[Mon Feb 27 21:49:45 2023]  <TASK>
[Mon Feb 27 21:49:45 2023]  mlx5_free_priv_descs+0x5b/0x80 [mlx5_ib]
[Mon Feb 27 21:49:45 2023]  mlx5_ib_dereg_mr+0x324/0x3b0 [mlx5_ib]
[Mon Feb 27 21:49:45 2023]  ? _raw_spin_lock_irqsave+0x23/0x50
[Mon Feb 27 21:49:45 2023]  ib_dereg_mr_user+0x3c/0xc0 [ib_core]
[Mon Feb 27 21:49:45 2023]  destroy_hw_idr_uobject+0x1c/0x50 [ib_uverbs]
[Mon Feb 27 21:49:45 2023]  uverbs_destroy_uobject+0x34/0x1e0 [ib_uverbs]
[Mon Feb 27 21:49:45 2023]  __uverbs_cleanup_ufile+0xbd/0x130 [ib_uverbs]
[Mon Feb 27 21:49:45 2023]  uverbs_destroy_ufile_hw+0x38/0xf0 [ib_uverbs]
[Mon Feb 27 21:49:45 2023]  ib_uverbs_close+0x1f/0xb0 [ib_uverbs]
[Mon Feb 27 21:49:45 2023]  __fput+0x8e/0x250
[Mon Feb 27 21:49:45 2023]  task_work_run+0x56/0x90
[Mon Feb 27 21:49:45 2023]  do_exit+0x348/0xb00
[Mon Feb 27 21:49:45 2023]  ? fq_ring_free+0x55/0xa0
[Mon Feb 27 21:49:45 2023]  do_group_exit+0x2d/0x80
[Mon Feb 27 21:49:45 2023]  get_signal+0x96a/0x970
[Mon Feb 27 21:49:45 2023]  ? __run_timers+0x144/0x2a0
[Mon Feb 27 21:49:45 2023]  arch_do_signal_or_restart+0x3e/0x840
[Mon Feb 27 21:49:45 2023]  ? _raw_spin_unlock_irqrestore+0x23/0x40
[Mon Feb 27 21:49:45 2023]  ? rcu_core+0x1ff/0x4e0
[Mon Feb 27 21:49:45 2023]  exit_to_user_mode_prepare+0x114/0x1c0
[Mon Feb 27 21:49:45 2023]  irqentry_exit_to_user_mode+0x5/0x30
[Mon Feb 27 21:49:45 2023]  asm_sysvec_reschedule_ipi+0x16/0x20
[Mon Feb 27 21:49:45 2023] RIP: 0033:0x5647d75dd5d0
[Mon Feb 27 21:49:45 2023] Code: Unable to access opcode bytes at 0x5647d75dd5a6.
[Mon Feb 27 21:49:45 2023] RSP: 002b:00007fff7a9354c0 EFLAGS: 00000202
[Mon Feb 27 21:49:45 2023] RAX: 0000207f7bf8e7d5 RBX: 0000000000000000 RCX: 71e0aba9f77f72fb
[Mon Feb 27 21:49:45 2023] RDX: 000000000000207f RSI: 15e8f459b38151ec RDI: 00000001764510b8
[Mon Feb 27 21:49:45 2023] RBP: 00007fff7a9354c0 R08: 9754c2e1e82aa2ed R09: 71e0aba980000000
[Mon Feb 27 21:49:45 2023] R10: 0000000176451040 R11: 000000017ab4ab40 R12: 00005647d74de0d0
[Mon Feb 27 21:49:45 2023] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[Mon Feb 27 21:49:45 2023]  </TASK>
[Mon Feb 27 21:49:45 2023] ---[ end trace 0000000000000000 ]---

dmesg_dump_27022023.log

From my limited information the stopping of the service should be done gracefully (signal 15).

Sometimes the process is not dp_service, but dpdk telemetry or grpc process, seems random.

@PlagueCZ PlagueCZ added the bug Something isn't working label Mar 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant