From f9d3d897ea31593dd3fc0d643ed9ed7b2eae94ee Mon Sep 17 00:00:00 2001 From: JamesWrigley Date: Sun, 17 Nov 2024 22:11:39 +0100 Subject: [PATCH] Fix a race condition in the tests Quick explanation of what the `GC tests for RemoteChannels` test does: 1. Create `RemoteChannel`s `rr` and `fstore` on worker 1 and worker 2 respectively from the master process. At this point only the master process knows about `rr` and `fstore`. 2. Master process calls `put!(fstore, rr)`, i.e. we remotecall worker 2 and put `rr` (which is owned worker 1 but is currently only known about by the master) into `fstore`. 3. Remotecall into worker 1 and check that it knows about `rr`. Step 3 should succeed despite us never previously explicitly communicating with worker 1, because `serialize(::ClusterSerializer, ::RemoteChannel)` will send a message to the owner of the `RemoteChannel` to inform them of its existence (see `send_add_client()`). This happens asynchronously in step 2, and on rare occasions worker 1 would not process that message before step 3, causing the test to fail. Now we give the check 10s to succeed. --- test/distributed_exec.jl | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/test/distributed_exec.jl b/test/distributed_exec.jl index 71f6ac6..d80d642 100644 --- a/test/distributed_exec.jl +++ b/test/distributed_exec.jl @@ -295,7 +295,9 @@ let wid1 = workers()[1], put!(fstore, rr) if include_thread_unsafe_tests() - @test remotecall_fetch(k -> haskey(Distributed.PGRP.refs, k), wid1, rrid) == true + # timedwait() is necessary because wid1 is asynchronously informed of + # the existence of rr/rrid through the call to `put!(fstore, rr)`. + @test timedwait(() -> remotecall_fetch(k -> haskey(Distributed.PGRP.refs, k), wid1, rrid), 10) === :ok end finalize(rr) # finalize locally yield() # flush gc msgs