Skip to content
lundman edited this page Oct 25, 2013 · 7 revisions

== kernel debug ==

Dealing with panics.

Boot target VM with "-v keepsyms=y debug=0x144"

Make it panic.

On your dev machine, you will need the kernelDebugKit from apple. Download it.

gdb /Volumes/Kernelit/mach_kernel
source /Volumes/KernelDebugKit/kgmacros
target remote-kdp

kdp-reattach  192.168.30.133   # obviously use the IP of your target / crashed VM

showallkmods
(find "address" for zfs and spl modules)
^Z   # suspend gdb, or, use another terminal

kextutil -s /tmp -n -k /Volumes/KernelDebugKit/mach_kernel -e -r /Volumes/KernelDebugKit module/zfs/zfs.kext/ ../spl/module/spl/spl.kext/

fg # resume gdb, or go back to gdb terminal
set kext-symbol-file-path /tmp

add-kext /tmp/spl.kext
add-kext /tmp/zfs.kext

bt

=== Non-panic ===

You can always panic a kernel with

sudo dtrace -w -n "BEGIN{ panic();}"

If you prefer to work in gdb.

But this was revealing;

#  /usr/libexec/stackshot -i -f /tmp/stackshot.log 
# symstacks.rb -f /tmp/stackshot.log -s -w /tmp/trace.txt
# less /tmp/trace.txt

Note that my hang is here;

PID: 156
    Process: zpool
    Thread ID: 0x4e2
    Thread state: 0x9 == TH_WAIT |TH_UNINT 
    Thread wait_event: 0xffffff8006608a6c
    Kernel stack: 
    machine_switch_context (in mach_kernel) + 366 (0xffffff80002b3d3e)
      0xffffff800022e711 (in mach_kernel) + 1281 (0xffffff800022e711)
        thread_block_reason (in mach_kernel) + 300 (0xffffff800022d9dc)
          lck_mtx_sleep (in mach_kernel) + 78 (0xffffff80002265ce)
            0xffffff8000569ef6 (in mach_kernel) + 246 (0xffffff8000569ef6)
              msleep (in mach_kernel) + 116 (0xffffff800056a2e4)
                0xffffff7f80e52a76 (0xffffff7f80e52a76)
                  0xffffff7f80e53fae (0xffffff7f80e53fae)
                    0xffffff7f80e54173 (0xffffff7f80e54173)
                      0xffffff7f80f1a870 (0xffffff7f80f1a870)
                        0xffffff7f80f2bb4e (0xffffff7f80f2bb4e)
                          0xffffff7f80f1a9b7 (0xffffff7f80f1a9b7)
                            0xffffff7f80f1b65f (0xffffff7f80f1b65f)
                              0xffffff7f80f042ee (0xffffff7f80f042ee)
                                0xffffff7f80f45c5b (0xffffff7f80f45c5b)
                                  0xffffff7f80f4ce92 (0xffffff7f80f4ce92)
                                    spec_ioctl (in mach_kernel) + 157 (0xffffff8000320bfd)
                                      VNOP_IOCTL (in mach_kernel) + 244 (0xffffff8000311e84)

It is a little shame that it only shows the kernel symbols, and not inside SPL and ZFS. But we can ask it to load another sym file (alas, it can not handle multiple symbols files, fix this Apple)

# kextstat
# Grab the addresses of spl and zfs again
# kextutil -s /tmp -n -k /Volumes/KernelDebugKit/mach_kernel -e -r /Volumes/KernelDebugKit module/zfs/zfs.kext/ ../spl/module/spl/spl.kext/ 

# symstacks.rb -f /tmp/stackshot.log -s -k /tmp/net.lundman.spl.sym
              0xffffff800056a2e4 (0xffffff800056a2e4)
                spl_cv_wait (in net.lundman.spl.sym) + 54 (0xffffff7f80e52a76)
                  taskq_wait (in net.lundman.spl.sym) + 78 (0xffffff7f80e53fae)
                    taskq_destroy (in net.lundman.spl.sym) + 35 (0xffffff7f80e54173)
                      0xffffff7f80f1a870 (0xffffff7f80f1a870)

# symstacks.rb -f /tmp/stackshot.log -s -k /tmp/net.lundman.zfs.sym
                    0xffffff7f80e54173 (0xffffff7f80e54173)
                      vdev_open_children (in net.lundman.zfs.sym) + 336 (0xffffff7f80f1a870)
                        vdev_root_open (in net.lundman.zfs.sym) + 94 (0xffffff7f80f2bb4e)
                          vdev_open (in net.lundman.zfs.sym) + 311 (0xffffff7f80f1a9b7)
                            vdev_create (in net.lundman.zfs.sym) + 31 (0xffffff7f80f1b65f)
                              spa_create (in net.lundman.zfs.sym) + 878 (0xffffff7f80f042ee)

Voila!

=== Memory Leaks ===

If you suspect memory issues, for example from a panic like:

panic(cpu 1 caller 0xffffff80002438d8): "zalloc: \"kalloc.1024\" (100535 elements) retry fail 3, kfree_nop_count: 0"@/SourceCache/xnu/xnu-2050.7.9/osfmk/kern/zalloc.c:1826

Attach gdb and use thezprint command;

(gdb) zprint
ZONE                   COUNT   TOT_SZ   MAX_SZ   ELT_SZ ALLOC_SZ         TOT_ALLOC         TOT_FREE NAME
0xffffff8002a89250   1620133  18c1000  22a3599       16     1000         125203838        123583705 kalloc.16 CX
0xffffff8006306c50    110335   35f000   4ce300       32     1000          13634985         13524650 kalloc.32 CX
0xffffff8006306a00    133584   82a000   e6a900       64     1000          26510120         26376536 kalloc.64 CX
0xffffff80063067b0    610090  4a84000  614f4c0      128     1000          50524515         49914425 kalloc.128 CX
0xffffff8006306560   1070398 121a2000 1b5e4d60      256     1000          72534632         71464234 kalloc.256 CX
0xffffff8006306310    399302  d423000  daf26b0      512     1000          39231204         38831902 kalloc.512 CX
0xffffff80063060c0    100404  6231000  c29e980     1024     1000          22949693         22849289 kalloc.1024 CX
0xffffff8006305e70       292    9a000   200000     2048     1000          77633725         77633433 kalloc.2048 CX

In this case, kalloc.256 is suspect.

Reboot kernel with zlog=kalloc.256 on the command line, then we can use

(gdb) findoldest                                                                
oldest record is at log index 393:

--------------- ALLOC  0xffffff803276ec00 : index 393  :  ztime 21643824 -------------
0xffffff800024352e <zalloc_canblock+78>:        mov    %eax,-0xcc(%rbp)
0xffffff80002245bd <get_zone_search+23>:        jmpq   0xffffff80002246d8 <KALLOC_ZINFO_SALLOC+35>
0xffffff8000224c39 <OSMalloc+89>:       mov    %rax,-0x18(%rbp)
0xffffff7f80e847df <zfs_kmem_alloc+15>: mov    %rax,%r15
0xffffff7f80e90649 <arc_buf_alloc+41>:  mov    %rax,-0x28(%rbp)

and indeed, list any index

(gdb) zstack 394

--------------- ALLOC  0xffffff8032d60700 : index 394  :  ztime 21648810 -------------
0xffffff800024352e <zalloc_canblock+78>:        mov    %eax,-0xcc(%rbp)
0xffffff80002245bd <get_zone_search+23>:        jmpq   0xffffff80002246d8 <KALLOC_ZINFO_SALLOC+35>
0xffffff8000224c39 <OSMalloc+89>:       mov    %rax,-0x18(%rbp)
0xffffff7f80e847df <zfs_kmem_alloc+15>: mov    %rax,%r15
0xffffff7f80e90649 <arc_buf_alloc+41>:  mov    %rax,-0x28(%rbp)

How many times was zfs_kmem_alloc involved in the leaked allocs?

(gdb) countpcs 0xffffff7f80e847df
occurred 3999 times in log (100% of records)

At least we know it is our fault.

How many times is it arc_buf_alloc?

(gdb) countpcs 0xffffff7f80e90649
occurred 2390 times in log (59% of records)

== lldb kernel debug ==

 echo "settings set target.load-script-from-symbol-file true" >> ~/.lldbinit
 lldb /Volumes/KernelDebugKit/mach_kernel
 kdp-remote  192.168.30.146
 showallkmods
 addkext -F /tmp/spl.kext/Contents/MacOS/spl 0xffffff7f8ebb0000   (Address from showallkmods)
 addkext -F /tmp/zfs.kext/Contents/MacOS/zfs 0xffffff7f8ebbf000

 Then follow guide for GDB above

Clone this wiki locally