-
Notifications
You must be signed in to change notification settings - Fork 23
debugger
[TOC]
I don't claim to be an expert using command line debuggers (gdb, lldb), but they are useful for finding where code hits exceptions or segfaults. debug-exception.cc is a simple test code that has functions foo1, foo2, foo3, foo4. Passing in 1 will throw in foo1, 2 will throw in foo2, etc. Just for fun it uses OpenMP so there are multiple threads, too, but that doesn't really change anything. This uses lldb on macOS; gdb has analogous functionality but different syntax. I added ### comments.
You may have to play around with the syntax. I figured this out from https://stackoverflow.com/questions/8122375/lldb-breakpoint-on-exceptions-equivalent-of-gdbs-catch-throw but there were several different syntaxes given that didn't work for me – maybe for different versions of lldb?
Although I also agree that we should be throwing exceptions that have
more useful information in them. That CUDA code may have originated
before we added the slate_cuda_call
, but it should be updated.
test/c++> make debug-exception
g++ -Wall -pedantic -std=c++11 -fopenmp -c -o debug-exception.o debug-exception.cc
g++ -fopenmp -o debug-exception debug-exception.o
### Run ./debug-exception 3, which throws in foo3.
thyme test/c++> ./debug-exception 3
main( 3 )
foo4( 3 )
foo3( 3, tid 0 )
foo3( 3, tid 1 )
foo3( 3, tid 2 )
terminate called recursively
terminate called recursively
foo3( 3, tid 3 )
Abort
### Now run it in the debugger.
thyme test/c++> lldb ./debug-exception
(lldb) target create "./debug-exception"
Current executable set to '/Users/mgates/Documents/test/c++/debug-exception' (x86_64).
### Set breakpoint on throwing C++ exceptions.
(lldb) break set -n __cxa_throw
Breakpoint 1: 2 locations.
### Run ./debug-exception 0, which doesn't throw an exception.
(lldb) run 0
Process 91619 launched: '/Users/mgates/Documents/test/c++/debug-exception' (x86_64)
main( 0 )
foo4( 0 )
foo3( 0, tid 1 )
foo2( 0, tid 1 )
foo1( 0, tid 1 )
foo3( 0, tid 1 )
foo2( 0, tid 1 )
foo1( 0, tid 1 )
foo3( 0, tid 0 )
foo2( 0, tid 0 )
foo1( 0, tid 0 )
foo3( 0, tid 0 )
foo3( 0, tid 3 )
foo3( 0, tid 2 )
foo2( 0, tid 2 )
foo1( 0, tid 2 )
foo2( 0, tid 3 )
foo1( 0, tid 3 )
foo3( 0, tid 3 )
foo2( 0, tid 3 )
foo3( 0, tid 2 )
foo2( 0, tid 0 )
foo2( 0, tid 2 )
foo1( 0, tid 3 )
foo1( 0, tid 2 )
foo3( 0, tid 1 )
foo2( 0, tid 1 )
foo1( 0, tid 1 )
foo1( 0, tid 0 )
foo3( 0, tid 0 )
foo2( 0, tid 0 )
foo1( 0, tid 0 )
Process 91619 exited with status = 0 (0x00000000)
### Run ./debug-exception 2, which throws an exception in foo2.
(lldb) run 2
Process 91625 launched: '/Users/mgates/Documents/test/c++/debug-exception' (x86_64)
main( 2 )
foo4( 2 )
foo3( 2, tid 0 )
foo2( 2, tid 0 )
foo3( 2, tid 2 )
foo2( 2, tid 2 )
foo3( 2, tid 1 )
Process 91625 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
frame #0: 0x0000000100122260 libstdc++.6.dylib`__cxa_throw
libstdc++.6.dylib`__cxa_throw:
-> 0x100122260 <+0>: pushq %r13
0x100122262 <+2>: movq %rdx, %r13
0x100122265 <+5>: pushq %r12
0x100122267 <+7>: movq %rsi, %r12
thread #3, stop reason = breakpoint 1.1
frame #0: 0x0000000100122260 libstdc++.6.dylib`__cxa_throw
libstdc++.6.dylib`__cxa_throw:
-> 0x100122260 <+0>: pushq %r13
0x100122262 <+2>: movq %rdx, %r13
0x100122265 <+5>: pushq %r12
0x100122267 <+7>: movq %rsi, %r12
Target 0: (debug-exception) stopped.
### Looking at the backtrace (bt), we see it was in foo2.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
* frame #0: 0x0000000100122260 libstdc++.6.dylib`__cxa_throw
frame #1: 0x0000000100003ac2 debug-exception`foo2(int, int) + 104
frame #2: 0x0000000100003b4f debug-exception`foo3(int, int) + 119
frame #3: 0x0000000100003d0a debug-exception`foo4(int) (._omp_fn.0) + 100
frame #4: 0x0000000100452bd2 libgomp.1.dylib`GOMP_parallel + 66
frame #5: 0x0000000100003bd9 debug-exception`foo4(int) + 131
frame #6: 0x0000000100003c3a debug-exception`main + 90
frame #7: 0x00007fff6ce1bcc9 libdyld.dylib`start + 1
### Run ./debug-exception 3, which throws an exception in foo3.
(lldb) kill
Process 91625 exited with status = 9 (0x00000009)
(lldb) run 3
Process 91633 launched: '/Users/mgates/Documents/test/c++/debug-exception' (x86_64)
main( 3 )
foo4( 3 )
foo3( 3, tid 0 )
foo3( 3, tid 1 )
foo3( 3, tid 2 )
Process 91633 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
frame #0: 0x0000000100122260 libstdc++.6.dylib`__cxa_throw
libstdc++.6.dylib`__cxa_throw:
-> 0x100122260 <+0>: pushq %r13
0x100122262 <+2>: movq %rdx, %r13
0x100122265 <+5>: pushq %r12
0x100122267 <+7>: movq %rsi, %r12
thread #2, stop reason = breakpoint 1.1
frame #0: 0x0000000100122260 libstdc++.6.dylib`__cxa_throw
libstdc++.6.dylib`__cxa_throw:
-> 0x100122260 <+0>: pushq %r13
0x100122262 <+2>: movq %rdx, %r13
0x100122265 <+5>: pushq %r12
0x100122267 <+7>: movq %rsi, %r12
Target 0: (debug-exception) stopped.
### Looking at the backtrace (bt), we see it was in foo3.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
* frame #0: 0x0000000100122260 libstdc++.6.dylib`__cxa_throw
frame #1: 0x0000000100003b40 debug-exception`foo3(int, int) + 104
frame #2: 0x0000000100003d0a debug-exception`foo4(int) (._omp_fn.0) + 100
frame #3: 0x0000000100452bd2 libgomp.1.dylib`GOMP_parallel + 66
frame #4: 0x0000000100003bd9 debug-exception`foo4(int) + 131
frame #5: 0x0000000100003c3a debug-exception`main + 90
frame #6: 0x00007fff6ce1bcc9 libdyld.dylib`start + 1
(lldb) kill
Process 91633 exited with status = 9 (0x00000009)
(lldb) ^D
It needs -g
flag, and for at least test/test.o using -O0
.
Here's my make.inc file on leconte showing the additions:
slate> cat make.inc
CXX = mpicxx
FC = mpif90
#################### Added these lines ####################
CXXFLAGS = -g -Wno-unused-variable
# This is in SLATE's GNUmakefile since https://bitbucket.org/icl/slate-dev/pull-requests/137
# For `tester --debug` purposes, compile test.o with -O0 (after -O3).
test/test.o: CXXFLAGS += -O0
#################### ####################
# BLAS can be mkl or openblas (or others on other systems). Choose one.
blas = mkl
#blas = openblas
# Intel MKL supports gfortran conventions and ifort conventions.
# Choose one to match mpif90 compiler.
blas_fortran = gfortran
#blas_fortran = ifort
# Intel MKL supports Open MPI and Intel MPI.
# Choose one to match MPI library.
#mkl_blacs = openmpi
mkl_blacs = intelmpi
cuda_arch = volta
gpu_backend = cuda
For instance, when I compile test.o, the command is:
mpicxx -g -Wno-unused-variable -O3 -std=c++17 \
-Wall -Wshadow -pedantic -MMD -fPIC -fopenmp \
-DSLATE_WITH_MKL -DSLATE_NO_HIP \
-I./blaspp/include -I./lapackpp/include -I./include -I./src \
-O0 -I./testsweeper -c test/test.cc -o test/test.o
where the later -O0
overrides the earlier -O3
.
Here's an example that is failing. I added Tile A00 = A( 0, 0 );
in
src/internal/internal_gemm.cc, which fails on ranks where A( 0, 0 )
doesn't exist.
slate/test> mpirun -np 4 ./tester gemm
SLATE version 2022.05.00, id 483bde4a
input: ./tester gemm
2022-06-02 14:38:26, MPI size 4, OpenMP threads 20, GPU devices available 8
type origin target m ... error time (s) ... status
d host task 100 ... 3.11e-16 0.000441 ... pass
d host task 200 ... 4.60e-16 0.00149 ... pass
d host task 300 ... 2.92e-16 0.00304 ... pass
terminate called after throwing an instance of 'std::out_of_range'
what(): map::at
Adding the --debug R
flag to the tester will cause rank R to wait for
debugger to attach (here, R = 1).
slate/test> mpirun -np 4 ./tester --debug 1 gemm
MPI rank 1, pid 71503 on leconte.icl.utk.edu ready for debugger (gdb/lldb) to attach.
After attaching, step out to run() and set i=1, e.g.:
lldb -p 71503
(lldb) break set -n __cxa_throw # break on C++ exception
(lldb) thread step-out # repeat
(lldb) expr i=1
(lldb) continue
Rank 1 waits here for a debugger to attach. Once a debugger attaches and continues execution (see below), the tester will keep going.
SLATE version 2022.05.00, id 483bde4a
input: ./tester --debug 1 gemm
2022-06-02 14:41:31, MPI size 4, OpenMP threads 20, GPU devices available 8
type origin target m ... error time (s) ... status
d host task 100 ... 3.14e-16 0.000483 ... pass
d host task 200 ... 4.48e-16 0.00147 ... pass
d host task 300 ... 2.83e-16 0.00319 ... pass
terminate called after throwing an instance of 'std::out_of_range'
what(): map::at
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 0 PID 71502 RUNNING AT leconte.icl.utk.edu
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================
Run lldb
or gdb
debugger in a separate terminal, attaching to the
tester process per instructions that SLATE's tester printed (above).
> lldb -p 71503
Process 71503 stopped
* thread #1, name = 'tester', stop reason = signal SIGSTOP
frame #0: 0x00007f340f8c89fd libc.so.6`__nanosleep + 45
libc.so.6`__nanosleep:
-> 0x7f340f8c89fd <+45>: movq (%rsp), %rdi
0x7f340f8c8a01 <+49>: movq %rax, %rdx
0x7f340f8c8a04 <+52>: callq 0x7f340f90f890 ; __libc_disable_asynccancel
0x7f340f8c8a09 <+57>: movq %rdx, %rax
thread #2, name = 'cuda-EvtHandlr', stop reason = signal SIGSTOP
frame #0: 0x00007f340f8f6ddd libc.so.6`poll + 45
libc.so.6`poll:
-> 0x7f340f8f6ddd <+45>: movq (%rsp), %rdi
0x7f340f8f6de1 <+49>: movq %rax, %rdx
0x7f340f8f6de4 <+52>: callq 0x7f340f90f890 ; __libc_disable_asynccancel
0x7f340f8f6de9 <+57>: movq %rdx, %rax
Break on C++ exceptions:
(lldb) break set -n __cxa_throw
Breakpoint 2: where = libstdc++.so.6`__cxxabiv1::__cxa_throw(void *, std::type_info *, void (*)(void *)) at eh_throw.cc:77:1, address = 0x00007f34103cdff0
It's helpful to immediately do a backtrace and disassembly when breaking; here's a hook from an lldb cheat sheet. I've found sometimes other MPI ranks will cause the whole program to abort without having time to manually run debugger commands.
(lldb) target stop-hook add
Enter your stop hook command(s). Type 'DONE' to end.
> bt
> disassemble --pc
DONE
Stop hook #1 added.
Initially the debugger will probably be stopped in some system sleep
routine (__nanosleep
). Use thread step-out
a few times until it
shows the SLATE tester source code with while (0 == i)
.
(lldb) thread step-out
* thread #1, name = 'tester', stop reason = step out
* frame #0: 0x00007f340f8c8894 libc.so.6`sleep + 212
frame #1: 0x000000000045d2c7 tester`run(argc=4, argv=0x00007ffe1f6d35d8) at test.cc:651:22
frame #2: 0x000000000045db42 tester`main(argc=4, argv=0x00007ffe1f6d35d8) at test.cc:764:21
frame #3: 0x00007f340f825555 libc.so.6`__libc_start_main + 245
frame #4: 0x0000000000459677 tester`_start + 41
libc.so.6`sleep:
-> 0x7f340f8c8894 <+212>: movl %eax, %ebx
0x7f340f8c8896 <+214>: testl %ebx, %ebx
0x7f340f8c8898 <+216>: je 0x7f340f8c88c0 ; <+256>
0x7f340f8c889a <+218>: xorl %ebp, %ebp
Process 71503 stopped
* thread #1, name = 'tester', stop reason = step out
frame #0: 0x00007f340f8c8894 libc.so.6`sleep + 212
libc.so.6`sleep:
-> 0x7f340f8c8894 <+212>: movl %eax, %ebx
0x7f340f8c8896 <+214>: testl %ebx, %ebx
0x7f340f8c8898 <+216>: je 0x7f340f8c88c0 ; <+256>
0x7f340f8c889a <+218>: xorl %ebp, %ebp
(lldb) thread step-out
* thread #1, name = 'tester', stop reason = step out
* frame #0: 0x000000000045d2c7 tester`run(argc=4, argv=0x00007ffe1f6d35d8) at test.cc:650:13
frame #1: 0x000000000045db42 tester`main(argc=4, argv=0x00007ffe1f6d35d8) at test.cc:764:21
frame #2: 0x00007f340f825555 libc.so.6`__libc_start_main + 245
frame #3: 0x0000000000459677 tester`_start + 41
tester`run:
-> 0x45d2c7 <+2711>: jmp 0x45d2ae ; <+2686> at test.cc:650:22
0x45d2c9 <+2713>: movl $0x44000000, %edi ; imm = 0x44000000
0x45d2ce <+2718>: callq 0x435230 ; symbol stub for: MPI_Barrier
0x45d2d3 <+2723>: movl %eax, -0x64(%rbp)
Process 71503 stopped
* thread #1, name = 'tester', stop reason = step out
frame #0: 0x000000000045d2c7 tester`run(argc=4, argv=0x00007ffe1f6d35d8) at test.cc:650:13
647 "(lldb) continue\n",
648 mpi_rank, getpid(), hostname, getpid() );
649 fflush( stdout );
-> 650 while (0 == i)
651 sleep(1);
652 }
653 slate_mpi_call( MPI_Barrier( MPI_COMM_WORLD ) );
Setting expr i=1
will break that while loop. If the debugger doesn't
know the variable i
, check that you compiled with -g
and -O0
.
(lldb) expr i=1
(volatile int) $1 = 1
Continue running until a C++ exception or breakpoint occurs, or the program completes. Here it broke at a C++ exception which the back trace, in frame #5, shows occurred in slate::internal::gemm.cc line 76, which is indeed where the error was injected.
(lldb) continue
Process 71503 resuming
thread #11, name = 'tester', stop reason = breakpoint 1.1 2.1
frame #0: 0x00007f34103cdff0 libstdc++.so.6`__cxxabiv1::__cxa_throw(obj=0x00007f3198000960, tinfo=0x00007f34106e9228, dest=(libstdc++.so.6`std::out_of_range::~out_of_range() at stdexcept.cc:65:33))(void *)) at eh_throw.cc:77:1
frame #1: 0x00007f34103c5352 libstdc++.so.6`std::__throw_out_of_range(__s="map::at") at functexcept.cc:82:5
frame #2: 0x00000000004b3c4c tester`slate::BaseMatrix<double>::operator()(long, long, int) at stl_map.h:541:24
frame #3: 0x00000000004b3c40 tester`slate::BaseMatrix<double>::operator()(long, long, int) at MatrixStorage.hh:388
frame #4: 0x00000000004b3c40 tester`slate::BaseMatrix<double>::operator(this=0x00007f330cb6edc0, i=0, j=0, device=-1)(long, long, int) at BaseMatrix.hh:1236
frame #5: 0x00007f34239b696e libslate.so`void slate::internal::gemm<double>((null)=TargetType<(slate::Target)84> @ 0x00007f330cb6ece0, alpha=3.1415926535897931, A=0x00007f330cb6edc0, B=0x00007f330cb6ed40, beta=2.7182818284590451, C=0x00007ffe1f6cde40, layout=ColMajor, priority=0, queue_index=<unavailable>, opts=error: summary string parsing error)84>, double, slate::Matrix<double>&, slate::Matrix<double>&, double, slate::Matrix<double>&, blas::Layout, int, long, std::map<slate::Option, slate::OptionValue, std::less<slate::Option>, std::allocator<std::pair<slate::Option const, slate::OptionValue> > > const&) at internal_gemm.cc:76:13
frame #6: 0x00007f34239b6f77 libslate.so`void slate::internal::gemm<(slate::Target)84, double>(alpha=<unavailable>, A=<unavailable>, B=<unavailable>, beta=<unavailable>, C=<unavailable>, layout=<unavailable>, priority=<unavailable>, queue_index=<unavailable>, opts=error: summary string parsing error) at internal_gemm.cc:52:9
frame #7: 0x00007f3423d4f3ad libslate.so`_ZN5slate5gemmCILNS_6TargetE84EdEEvT0_RNS_6MatrixIS2_EES5_S2_S5_RKSt3mapINS_6OptionENS_11OptionValueESt4lessIS7_ESaISt4pairIKS7_S8_EEE._omp_fn.4((null)=0x00007f330cb6edc0) at gemmC.cc:106:35
frame #8: 0x00007f340fdfe1f4 libgomp.so.1`gomp_barrier_handle_tasks(state=320) at task.c:1387:6
frame #9: 0x00007f340fe05818 libgomp.so.1`gomp_team_barrier_wait_end(bar=<unavailable>, state=320) at bar.c:116:4
frame #10: 0x00007f340fe02e32 libgomp.so.1`gomp_thread_start(xdata=<unavailable>) at team.c:124:4
frame #11: 0x00007f341bab7ea5 libpthread.so.0`start_thread + 197
frame #12: 0x00007f340f901b0d libc.so.6`__clone + 109
libstdc++.so.6`__cxxabiv1::__cxa_throw(void *, std::type_info *, void (*)(void *)):
-> 0x7f34103cdff0 <+0>: pushq %r13
0x7f34103cdff2 <+2>: movq %rdx, %r13
0x7f34103cdff5 <+5>: pushq %r12
0x7f34103cdff7 <+7>: movq %rsi, %r12
Process 71503 stopped
* thread #11, name = 'tester', stop reason = breakpoint 1.1 2.1
frame #0: 0x00007f34103cdff0 libstdc++.so.6`__cxxabiv1::__cxa_throw(obj=0x00007f3198000960, tinfo=0x00007f34106e9228, dest=(libstdc++.so.6`std::out_of_range::~out_of_range() at stdexcept.cc:65:33))(void *)) at eh_throw.cc:77:1
Process 71503 exited with status = -1 (0xffffffff) debugserver died with an exit status of 0x00000000
Initial commands can be put into a init.lldb file:
break set -n __cxa_throw
target stop-hook add
bt
disassemble --pc
DONE
that is sourced when running lldb:
lldb -s init.lldb -p 71503