Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thread-safe access to rmw_node_data_t #259

Merged
merged 7 commits into from
Sep 30, 2024
Merged

Thread-safe access to rmw_node_data_t #259

merged 7 commits into from
Sep 30, 2024

Conversation

Yadunund
Copy link
Member

@Yadunund Yadunund commented Aug 6, 2024

This PR builds off #258.

  • Introduces a NodeData class which stores information about an rmw_node_t (RMW node). In subsequent PRs this class will be extended to store unordered_maps of PublisherData, SubscriptionData, etc.
  • rmw_context_impl_s has APIs to create, retrieve and delete NodeData objects created for various rmw_node_t.

@MichaelOrlov
Copy link

@Yadunund Friendly ping to resolve merge conflicts.

rmw_zenoh_cpp/src/detail/rmw_context_impl_s.cpp Outdated Show resolved Hide resolved
rmw_zenoh_cpp/src/detail/rmw_context_impl_s.cpp Outdated Show resolved Hide resolved
rmw_zenoh_cpp/src/detail/rmw_node_data_t.cpp Outdated Show resolved Hide resolved
rmw_zenoh_cpp/src/detail/rmw_node_data_t.cpp Outdated Show resolved Hide resolved
rmw_zenoh_cpp/src/detail/rmw_node_data_t.cpp Outdated Show resolved Hide resolved
rmw_zenoh_cpp/src/detail/rmw_node_data_t.cpp Outdated Show resolved Hide resolved
rmw_zenoh_cpp/src/rmw_init.cpp Outdated Show resolved Hide resolved
}

// Create the liveliness token.
zc_owned_liveliness_token_t token = zc_liveliness_declare_token(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I switched to an std::recursive_mutex in 5295774 as this zc_liveliness_declare_token call will publish a liveliness token which will cause the graph_subscriber in the same session to receive this token and trigger the graph callback. This graph callback also locks the same mutex within rmw_context_impl_s which causes a deadlock without a recursive_mutex

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit curious. What recursive_mutex does is to allow the same thread to take the lock multiple times without deadlocking. But my understanding (which may be wrong) was that the graph callback ran on a different thread of execution, which recursive_mutex will not guard against. Can you maybe explain a bit more so I can figure this out?

Copy link
Member Author

@Yadunund Yadunund Sep 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But my understanding (which may be wrong) was that the graph callback ran on a different thread of execution, which recursive_mutex will not guard against.

It seems that this is not the case. See my stack trace:

With the Zenoh router running, run rcl/test_client__rmw_zenoh_cpp or the usual ros2 run demo_nodes_cpp talker.

/home/yadunund/ros2_rolling/install/ament_cmake_test/share/ament_cmake_test/cmake/run_test.py /home/yadunund/ros2_rolling/build/rcl/test_results/rcl/test_client__rmw_zenoh_cpp.gtest.xml --package-name rcl --output-file /home/yadunund/ros2_rolling/build/rcl/ament_cmake_gtest/test_client__rmw_zenoh_cpp.txt --env RMW_IMPLEMENTATION=rmw_zenoh_cpp --command /home/yadunund/ros2_rolling/build/rcl/test/test_client --gtest_output=xml:/home/yadunund/ros2_rolling/build/rcl/test_results/rcl/test_client__rmw_zenoh_cpp.gtest.xml

Then get the pid of this process

yadunund@ubuntu-noble-20240225:~/ros2_rolling$ ps -a | grep test_client
1477812 pts/7    00:00:00 test_client

The pid is 1477812.

Then use gdb in a second terminal to attach to the pid above.

sudo gdb -q test_client 1477812

Run info threads

(gdb) info threads
  Id   Target Id                                             Frame 
* 1    Thread 0x7ffff76328c0 (LWP 1477812) "test_client"     futex_wait (private=0, expected=2, futex_word=0x555555619040) at ../sysdeps/nptl/futex-internal.h:146
  2    Thread 0x7fffefa006c0 (LWP 1477822) "rx-1"            syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  3    Thread 0x7fffefe006c0 (LWP 1477821) "rx-0"            0x00007ffff792a042 in epoll_wait (epfd=19, events=0x7fffe8131bf0, maxevents=1024, timeout=7979)
    at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
  4    Thread 0x7ffff44006c0 (LWP 1477820) "tx-0"            0x00007ffff792a042 in epoll_wait (epfd=16, events=0x7fffe8009ef0, maxevents=1024, timeout=2475)
    at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
  5    Thread 0x7ffff4c006c0 (LWP 1477818) "acc-0"           0x00007ffff792a042 in epoll_wait (epfd=12, events=0x555555699850, maxevents=1024, timeout=-1)
    at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
  6    Thread 0x7ffff54006c0 (LWP 1477816) "net-0"           0x00007ffff792a042 in epoll_wait (epfd=8, events=0x555555694fd0, maxevents=1024, timeout=-1)
    at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
  7    Thread 0x7ffff58006c0 (LWP 1477815) "app-0"           0x00007ffff792a042 in epoll_wait (epfd=3, events=0x555555690750, maxevents=1024, timeout=-1)
    at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
  8    Thread 0x7ffff6c006c0 (LWP 1477814) "test_client-ust" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  9    Thread 0x7ffff76006c0 (LWP 1477813) "test_client-ust" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38

Inspect thread 1 and get the backtrace with bt

(gdb) thread 1
[Switching to thread 1 (Thread 0x7ffff76328c0 (LWP 1477812))]
#0  futex_wait (private=0, expected=2, futex_word=0x555555619040) at ../sysdeps/nptl/futex-internal.h:146
146	in ../sysdeps/nptl/futex-internal.h
(gdb) bt 
#0  futex_wait (private=0, expected=2, futex_word=0x555555619040) at ../sysdeps/nptl/futex-internal.h:146
#1  __GI___lll_lock_wait (futex=futex@entry=0x555555619040, private=0) at ./nptl/lowlevellock.c:49
#2  0x00007ffff78a00f1 in lll_mutex_lock_optimized (mutex=0x555555619040) at ./nptl/pthread_mutex_lock.c:48
#3  ___pthread_mutex_lock (mutex=0x555555619040) at ./nptl/pthread_mutex_lock.c:93
#4  0x00007ffff6d35184 in __gthread_mutex_lock (__mutex=0x555555619040) at /usr/include/x86_64-linux-gnu/c++/13/bits/gthr-default.h:749
#5  0x00007ffff6d351d8 in std::mutex::lock (this=0x555555619040) at /usr/include/c++/13/bits/std_mutex.h:113
#6  0x00007ffff6d35392 in std::lock_guard<std::mutex>::lock_guard (this=0x7ffffffe3480, __m=...) at /usr/include/c++/13/bits/std_mutex.h:249
#7  0x00007ffff6d7e87a in rmw_context_impl_s::graph_sub_data_handler (sample=0x7ffffffe3570, data=0x555555619030)
    at /home/yadunund/ros2_rolling/src/ros2/rmw_zenoh/rmw_zenoh_cpp/src/detail/rmw_context_impl_s.cpp:57
#8  0x00007ffff5ba5fc5 in zenohc::liveliness::zc_liveliness_declare_subscriber::{{closure}} ()
   from /home/yadunund/ros2_rolling/install/zenoh_c_vendor/opt/zenoh_c_vendor/lib/libzenohc.so
#9  0x00007ffff5df2b6c in zenoh::session::Session::handle_data () from /home/yadunund/ros2_rolling/install/zenoh_c_vendor/opt/zenoh_c_vendor/lib/libzenohc.so
#10 0x00007ffff5e061c5 in <zenoh::session::Session as zenoh::net::primitives::Primitives>::send_declare ()
   from /home/yadunund/ros2_rolling/install/zenoh_c_vendor/opt/zenoh_c_vendor/lib/libzenohc.so
#11 0x00007ffff5b6f179 in <zenoh::session::Session as zenoh::net::primitives::EPrimitives>::send_declare ()
   from /home/yadunund/ros2_rolling/install/zenoh_c_vendor/opt/zenoh_c_vendor/lib/libzenohc.so
#12 0x00007ffff5dd5447 in <zenoh::net::routing::dispatcher::face::Face as zenoh::net::primitives::Primitives>::send_declare ()
   from /home/yadunund/ros2_rolling/install/zenoh_c_vendor/opt/zenoh_c_vendor/lib/libzenohc.so
#13 0x00007ffff5ba5093 in zc_liveliness_declare_token () from /home/yadunund/ros2_rolling/install/zenoh_c_vendor/opt/zenoh_c_vendor/lib/libzenohc.so
#14 0x00007ffff6d94ee3 in rmw_zenoh_cpp::NodeData::make (id=0, session=..., domain_id=0, namespace_="/", node_name="test_client_node", enclave="/")
    at /home/yadunund/ros2_rolling/src/ros2/rmw_zenoh/rmw_zenoh_cpp/src/detail/rmw_node_data.cpp:57
#15 0x00007ffff6d80b34 in rmw_context_impl_s::create_node_data (this=0x555555610730, node=0x5555556369c0, ns="/", node_name="test_client_node")
    at /home/yadunund/ros2_rolling/src/ros2/rmw_zenoh/rmw_zenoh_cpp/src/detail/rmw_context_impl_s.cpp:474
#16 0x00007ffff6d9fcae in rmw_create_node (context=0x5555556101b0, name=0x5555555b1b0b "test_client_node", namespace_=0x55555561e9d0 "/")
    at /home/yadunund/ros2_rolling/src/ros2/rmw_zenoh/rmw_zenoh_cpp/src/rmw_zenoh.cpp:236
#17 0x00007ffff7f8c3e9 in rcl_node_init () from /home/yadunund/ros2_rolling/install/rcl/lib/librcl.so
#18 0x000055555556ee9f in TestClientFixture::SetUp() ()
#19 0x00005555555ac6d1 in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) ()
#20 0x0000555555592b51 in testing::Test::Run() ()
#21 0x0000555555592d85 in testing::TestInfo::Run() ()
#22 0x0000555555592fc7 in testing::TestSuite::Run() ()
#23 0x00005555555a2acc in testing::internal::UnitTestImpl::RunAllTests() ()
#24 0x00005555555931af in testing::UnitTest::Run() ()
#25 0x0000555555565fb4 in main ()

Looking at the stack trace above you can see that the same Thread 1 calls rmw_context_impl_s::create_node_data() which invokes zc_liveliness_declare_token() then blocks on rmw_context_impl_s::graph_sub_data_handler since the same mutex is already locked.

Hope this clarifies!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Yeah, it is the case that the same thread is re-entering the lock, which is why we need a recursive_mutex here.

@Yadunund
Copy link
Member Author

Yadunund commented Sep 5, 2024

This is ready for another round of review.

}

// Create the liveliness token.
zc_owned_liveliness_token_t token = zc_liveliness_declare_token(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit curious. What recursive_mutex does is to allow the same thread to take the lock multiple times without deadlocking. But my understanding (which may be wrong) was that the graph callback ran on a different thread of execution, which recursive_mutex will not guard against. Can you maybe explain a bit more so I can figure this out?

///=============================================================================
NodeData::~NodeData()
{
zc_liveliness_undeclare_token(z_move(token_));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a huge fan of having different entities declare/undeclare the token. In this case, I'd prefer if we moved the declaration of the token into the constructor; that way there is no ambiguity of which entity owns it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call out. It was this way before I moved the constructor to public scope. Now i've implemented a make() function that will return nullptr if construction of the entity or liveliness token fails. bda3ed6

rmw_zenoh_cpp/src/detail/rmw_context_impl_s.cpp Outdated Show resolved Hide resolved
Base automatically changed from yadu/raii-context to rolling September 27, 2024 17:50
@Yadunund Yadunund changed the base branch from rolling to yadu/raii-context September 27, 2024 18:13
rmw_zenoh_cpp/src/detail/rmw_node_data.cpp Show resolved Hide resolved
rmw_zenoh_cpp/src/detail/rmw_node_data.cpp Show resolved Hide resolved
rmw_zenoh_cpp/src/detail/rmw_node_data.hpp Outdated Show resolved Hide resolved
rmw_zenoh_cpp/src/detail/rmw_node_data.hpp Show resolved Hide resolved
Copy link
Collaborator

@clalancette clalancette left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This generally looks good to me. This will need to be retargeted to rolling and rebased, and @ahcorde 's comments, fixed, but otherwise I think this is good to go.

@Yadunund Yadunund changed the base branch from yadu/raii-context to rolling September 30, 2024 16:46
@Yadunund Yadunund merged commit 824886a into rolling Sep 30, 2024
8 checks passed
@Yadunund Yadunund deleted the yadu/raii-node branch September 30, 2024 17:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants