Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Show TSAN stacktraces on Azure Linux #6870

Open
maxtropets opened this issue Feb 28, 2025 · 5 comments
Open

Show TSAN stacktraces on Azure Linux #6870

maxtropets opened this issue Feb 28, 2025 · 5 comments

Comments

@maxtropets
Copy link
Collaborator

At the moment looks like:

Image

Job: https://github.com/microsoft/CCF/actions/runs/13589076609/job/37990529965?pr=6616

@maxtropets
Copy link
Collaborator Author

I've reproduced it on local Azure Linux VM.

To begin with, I've checked that TSAN is able to load symbols in general on this machine.

I've tried a slightly modified example from clang docs a

#include <pthread.h>
int Global;
void *Thread1(void *x) {
  Global = 42;
  return x;
}
void bar(pthread_t &t) {
    pthread_create(&t, NULL, Thread1, NULL);
}
void foo(pthread_t &t) {
    bar(t);
}
int main() {
  pthread_t t;
  foo(t);
  Global = 43;
  pthread_join(t, NULL);
  return Global;
}

Compiled with clang -fsanitize=thread -g test-tsan.cpp and ran as ./a.out produced:

==================
WARNING: ThreadSanitizer: data race (pid=9092)
  Write of size 4 at 0x5555569f88a0 by main thread:
    #0 main /home/maxtropets/workspace/CCF/build/test-tsan.cpp:16:10 (a.out+0xf5840)

  Previous write of size 4 at 0x5555569f88a0 by thread T1:
    #0 Thread1(void*) /home/maxtropets/workspace/CCF/build/test-tsan.cpp:4:10 (a.out+0xf5768)

  Location is global 'Global' of size 4 at 0x5555569f88a0 (a.out+0x14a48a0)

  Thread T1 (tid=9094, finished) created by main thread at:
    #0 pthread_create <null> (a.out+0x42586)
    #1 bar(unsigned long&) /home/maxtropets/workspace/CCF/build/test-tsan.cpp:8:5 (a.out+0xf57c3)
    #2 foo(unsigned long&) /home/maxtropets/workspace/CCF/build/test-tsan.cpp:11:5 (a.out+0xf57f5)
    #3 main /home/maxtropets/workspace/CCF/build/test-tsan.cpp:15:3 (a.out+0xf5834)

SUMMARY: ThreadSanitizer: data race /home/maxtropets/workspace/CCF/build/test-tsan.cpp:16:10 in main
==================
ThreadSanitizer: reported 1 warnings

Now, trying to run ccf tests always produces smth like:

40:   Mutex M1 acquired here while holding mutex M0 in thread T2:
40:     #0 <null> <null> (cchost+0x5cb41)
40:     #1 <null> <null> (liblogging.virtual.so+0xcbb323)
40:     #2 <null> <null> (liblogging.virtual.so+0xdc2485)
40:     #3 <null> <null> (liblogging.virtual.so+0xdc2249)
40:     #4 <null> <null> (liblogging.virtual.so+0x1439a50)
40:     #5 <null> <null> (liblogging.virtual.so+0x14391d1)
40:     #6 <null> <null> (liblogging.virtual.so+0x14390e5)
40:     #7 <null> <null> (liblogging.virtual.so+0x1438f21)
40:     #8 <null> <null> (liblogging.virtual.so+0x1436414)
40:     #9 <null> <null> (liblogging.virtual.so+0x14362f0)
40:     #10 <null> <null> (liblogging.virtual.so+0x1436231)
40:     #11 <null> <null> (liblogging.virtual.so+0x1436145)
40:     #12 <null> <null> (liblogging.virtual.so+0x1435e61)
40:     #13 <null> <null> (liblogging.virtual.so+0xf2d7b4)
40:     #14 <null> <null> (liblogging.virtual.so+0xf2d5eb)
40:     #15 <null> <null> (liblogging.virtual.so+0xf2f977)
40:     #16 <null> <null> (liblogging.virtual.so+0xe597e8)
40:     #17 <null> <null> (liblogging.virtual.so+0x1018188)
40:     #18 <null> <null> (liblogging.virtual.so+0x121bbe6)
40:     #19 <null> <null> (liblogging.virtual.so+0x121a1fe)
40:     #20 <null> <null> (liblogging.virtual.so+0x10ecc38)
40:     #21 <null> <null> (liblogging.virtual.so+0x138f539)
40:     #22 <null> <null> (liblogging.virtual.so+0x138f26d)
40:     #23 <null> <null> (liblogging.virtual.so+0x138f165)
40:     #24 <null> <null> (liblogging.virtual.so+0x14cb6c3)
40:     #25 <null> <null> (liblogging.virtual.so+0x144542c)
40:     #26 <null> <null> (liblogging.virtual.so+0xed3ae4)
40:     #27 <null> <null> (liblogging.virtual.so+0xe5658d)
40:     #28 <null> <null> (cchost+0x49bf6d)
40:     #29 <null> <null> (cchost+0x49bc34)
40:     #30 <null> <null> (cchost+0x12f0e4)
40:     #31 <null> <null> (cchost+0x12f075)
40:     #32 <null> <null> (cchost+0x12efe5)
40:     #33 <null> <null> (cchost+0x12ef9d)
40:     #34 <null> <null> (cchost+0x12ef45)
40:     #35 <null> <null> (cchost+0x12ee09)
40:     #36 <null> <null> (libstdc++.so.6+0xec012) (BuildId: 4850484d65ef620145a5b9bf23f99d5cd446db13)

Setting ASAN_SYMBOLIZER_PATH didn't mitigate the issue.

Also, found this article about failing to parse symbols for dlclose-ed .so libraries, but removing dlclose call didn't help. Also it doesn't seem like the same issue, because the entire frame isn't there.

@eddyashton
Copy link
Member

I've now got an Azure Linux VM, and it looks like we do get stacktraces for failures in the unit tests:

$ ./ds_test 
[doctest] doctest version is "2.4.11"
[doctest] run with "--help" for options
^C==================
WARNING: ThreadSanitizer: signal-unsafe call inside of a signal (pid=12757)
    #0 operator new(unsigned long) <null> (ds_test+0x104e64)
    #1 std::__new_allocator<char>::allocate(unsigned long, void const*) /usr/bin/../lib64/gcc/x86_64-pc-linux-gnu/13.2.0/../../../../include/c++/13.2.0/bits/new_allocator.h:147:27 (ds_test+0x12693f)
    #2 std::allocator<char>::allocate(unsigned long) /usr/bin/../lib64/gcc/x86_64-pc-linux-gnu/13.2.0/../../../../include/c++/13.2.0/bits/allocator.h:198:32 (ds_test+0x12693f)
    #3 std::allocator_traits<std::allocator<char>>::allocate(std::allocator<char>&, unsigned long) /usr/bin/../lib64/gcc/x86_64-pc-linux-gnu/13.2.0/../../../../include/c++/13.2.0/bits/alloc_traits.h:482:20 (ds_test+0x12693f)
    #4 std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>::_S_allocate(std::allocator<char>&, unsigned long) /usr/bin/../lib64/gcc/x86_64-pc-linux-gnu/13.2.0/../../../../include/c++/13.2.0/bits/basic_string.h:126:16 (ds_test+0x12693f)
    #5 std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>::_M_create(unsigned long&, unsigned long) /usr/bin/../lib64/gcc/x86_64-pc-linux-gnu/13.2.0/../../../../include/c++/13.2.0/bits/basic_string.tcc:155:14 (ds_test+0x12693f)
    #6 void std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>::_M_construct<char const*>(char const*, char const*, std::forward_iterator_tag) /usr/bin/../lib64/gcc/x86_64-pc-linux-gnu/13.2.0/../../../../include/c++/13.2.0/bits/basic_string.tcc:225:14 (ds_test+0x12693f)
    #7 std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>::basic_string<std::allocator<char>>(char const*, std::allocator<char> const&) /usr/bin/../lib64/gcc/x86_64-pc-linux-gnu/13.2.0/../../../../include/c++/13.2.0/bits/basic_string.h:639:2 (ds_test+0x12693f)
    #8 doctest::(anonymous namespace)::FatalConditionHandler::handleSignal(int) /home/edashton/CCF/3rdparty/test/doctest/doctest.h:4810:25 (ds_test+0x12693f)
    #9 __tsan::ProcessPendingSignalsImpl(__tsan::ThreadState*) <null> (ds_test+0x50604)
    #10 doctest::String::copy(doctest::String const&) /home/edashton/CCF/3rdparty/test/doctest/doctest.h (ds_test+0x10ea54)
    #11 doctest::String::operator=(doctest::String const&) /home/edashton/CCF/3rdparty/test/doctest/doctest.h:3656:9 (ds_test+0x10ea54)
    #12 doctest::detail::ResultBuilder::setResult(doctest::detail::Result const&) /home/edashton/CCF/3rdparty/test/doctest/doctest.h:4923:18 (ds_test+0x10ea54)
    #13 void test<unsigned long>(unsigned long, unsigned long) /home/edashton/CCF/src/ds/test/contiguous_set.cpp:31:5 (ds_test+0x22ae6f)
    #14 DOCTEST_ANON_FUNC_19() /home/edashton/CCF/src/ds/test/contiguous_set.cpp:551:3 (ds_test+0x20f4f6)
    #15 doctest::Context::run() /home/edashton/CCF/3rdparty/test/doctest/doctest.h:7007:21 (ds_test+0x11436d)
    #16 main /home/edashton/CCF/3rdparty/test/doctest/doctest.h:7085:71 (ds_test+0x115c5a)

SUMMARY: ThreadSanitizer: signal-unsafe call inside of a signal (/home/edashton/CCF/build.tsan/ds_test+0x104e64) in operator new(unsigned long)
==================

Note also that this failed without any local modifications, which is surprising - perhaps we need DOCTEST_CONFIG_NO_POSIX_SIGNALS?

Regardless, this is a sign that these stacks aren't omitted for everything we build. Either the dlopen, or some other discrepancy in how we build cchost and libraries vs unit tests, is causing the callstack to get null'd

@achamayou
Copy link
Member

@eddyashton my money is on dlopen flags, but let's see.

@eddyashton
Copy link
Member

@eddyashton my money is on dlopen flags, but let's see.

I suspect so too, further evidenced by the fact that adding a simple violation to the start of cchost's main produces a sensible stack:

$ ./cchost 
==================
WARNING: ThreadSanitizer: data race (pid=21257)
  Write of size 4 at 0x555556d59e08 by main thread:
    #0 do_a_crime() /home/edashton/CCF/src/host/main.cpp:62:10 (cchost+0x111c32)
    #1 main /home/edashton/CCF/src/host/main.cpp:100:3 (cchost+0x111c32)

  Previous write of size 4 at 0x555556d59e08 by thread T1:
    #0 Thread1(void*) /home/edashton/CCF/src/host/main.cpp:50:10 (cchost+0x111879)

  Location is global 'Global' of size 4 at 0x555556d59e08 (cchost+0x1805e08)

  Thread T1 (tid=21259, finished) created by main thread at:
    #0 pthread_create <null> (cchost+0x57c36)
    #1 bar(unsigned long&) /home/edashton/CCF/src/host/main.cpp:54:5 (cchost+0x111c26)
    #2 foo(unsigned long&) /home/edashton/CCF/src/host/main.cpp:57:5 (cchost+0x111c26)
    #3 do_a_crime() /home/edashton/CCF/src/host/main.cpp:61:3 (cchost+0x111c26)
    #4 main /home/edashton/CCF/src/host/main.cpp:100:3 (cchost+0x111c26)

SUMMARY: ThreadSanitizer: data race /home/edashton/CCF/src/host/main.cpp:62:10 in do_a_crime()
==================
--config is required
Run with --help for more information.
ThreadSanitizer: reported 1 warnings

But the obvious culprit in the dlopen flags is insufficient - I still get nulls for (in-library) violations with:

-#if defined(__has_feature)
-#  if __has_feature(address_sanitizer)
+// #if defined(__has_feature)
+// #  if __has_feature(address_sanitizer)
         // Avoid unloading on delete under ASAN, so that leak checking can still
         // access symbols
         | RTLD_NODELETE
-#  endif
-#endif
+// #  endif
+// #endif

@maxtropets
Copy link
Collaborator Author

Tried linking against built-from-source openssl to make sure symcrypt is not guilty. Still nulls

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants