New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

fix: separate Heartbeat and ShardHandler to fibers #3936

Open

kostasrim wants to merge 6 commits into main from kpr1

Contributor

kostasrim commented Oct 16, 2024 •

edited

Loading

The issue surfaced when the test started failing in #3891. Heartbeat will block indefinitely if TriggerJournalWriteToSink can't push the serialized data to the channel (if the replica becomes unresponsive). The consequence of that is that the shard_handler and the BreakStalledFlows will never be called and df will stall. The solution is to split the blocking/preemptive flow (Heartbeat) and the BreakStalledFlows into separate fibers.

separate shard_handler from Heartbeat
add test

Resolves: #3937

kostasrim force-pushed the kpr1 branch from 0e8405e to 8bcfcf8 Compare

October 16, 2024 12:11

kostasrim self-assigned this


          fix: separate Heartbeat and ShardHandler to fibers

4c3baeb

Signed-off-by: kostas <[email protected]>

kostasrim force-pushed the kpr1 branch from 9c92244 to 4c3baeb Compare

October 16, 2024 13:27

kostasrim added 2 commits

October 16, 2024 16:27


          Merge branch 'main' into kpr1

8aa1430


          fixes

f08414f

kostasrim requested a review from adiholden

October 18, 2024 11:04

adiholden reviewed

View reviewed changes

src/server/engine_shard.h Outdated

                 void Heartbeat();
                 void RetireExpiredAndEvict();
-                void RunPeriodic(std::chrono::milliseconds period_ms, std::function<void()> shard_handler);
+                void RunHeartbeatPeriodic(std::chrono::milliseconds period_ms,

Collaborator

adiholden Oct 20, 2024

why does heartbeat still gets shard handler?

Contributor Author

kostasrim Oct 21, 2024

I wanted to keep the callback just in case we want to use it the future but I agree it redundant. I will remove

adiholden reviewed

View reviewed changes

src/server/engine_shard.cc Outdated

+                      });
+                } else {
+                  fiber_shard_handler_periodic_ =
+                      MakeFiber([this, index = pb->GetPoolIndex(), period_ms = clock_cycle_ms,

Collaborator

adiholden Oct 20, 2024

see the condition in origin code RunPeriodic to run the shard handler
if (shard_handler && last_handler_ms + 100 < last_heartbeat_ms)

we run shard_handler every 100 ms this is unrelated to to FLAGS_hz
I believe this function StartPeriodicFiberImpl is rednudant

Contributor Author

kostasrim Oct 23, 2024

It depends on the condition implicitly when the period is greater than 100 ms because we wait period_ms. So if let's say, period_ms=500 the shard handler will run every 500ms (we always wait period_ms between runs) so that 100 is really a minimum. I refactored this a little bit to mirror this behaviour

adiholden reviewed

View reviewed changes

tests/dragonfly/replication_test.py Outdated


		await asyncio.sleep(1)
		await asyncio.sleep(10)

Collaborator

adiholden Oct 20, 2024

is this change needed?

Contributor Author

kostasrim Oct 21, 2024

nope, it was accidental

adiholden reviewed

View reviewed changes

tests/dragonfly/replication_test.py Outdated


		await c_master.execute_command("debug", "populate", "100000", "foo", "5000")

		class ExpirySeeder:

Collaborator

adiholden Oct 20, 2024

put in utility.py

adiholden reviewed

View reviewed changes

tests/dragonfly/replication_test.py Outdated

+                  seeder = ExpirySeeder()
+                  seeder_task = asyncio.create_task(seeder.run(c_master))
+                  await seeder.wait_until(50000)

Collaborator

adiholden Oct 20, 2024

if the expire seeder will stop setting here wouldnt the bug will reproduce as well?
do we need to continue sending commands to get to the bug?

adiholden reviewed

View reviewed changes

tests/dragonfly/replication_test.py Outdated

+                      async def run(self, client):
+                          while not self.stop_flag:
+                              await client.execute_command(f"SET tmp{self.i} bar{self.i} EX 4")

Collaborator

adiholden Oct 20, 2024

use pipeline to speed up things

adiholden reviewed

View reviewed changes

tests/dragonfly/replication_test.py Outdated

+                              await client.execute_command(f"SET tmp{self.i} bar{self.i} EX 4")
+                              self.i = self.i + 1
+                      async def wait_until(self, count):

Collaborator

adiholden Oct 20, 2024

wait_untill_n_inserts or something like this

adiholden reviewed

View reviewed changes

tests/dragonfly/replication_test.py Outdated

+                  await c_replica.execute_command("debug replica pause")
+                  # Dragonfly will get stuck here. The journal writes to a channel which will block on write.
+                  # Hearbeat() will be called and will get stuck while it tries to evict an expired item (because

Collaborator

adiholden Oct 20, 2024

a. please write "expires" and not "evict" expired item
b. I would not write blocked while pushes to channel as this is very specific implemenation detail which also going to change. The reason we get stuck is that we block on write to socket as replica does not reads from it
c. please dont write that BreakStalledFlows() will never be called as this is your bug fix, I would describe the bug and this details in github bug issue

adiholden reviewed

View reviewed changes

tests/dragonfly/replication_test.py

+                  # Timeout set to 3 seconds because we must first saturate the socket such that subsequent
+                  # writes block. Otherwise, we will break the flows before Heartbeat actually deadlocks.
+                  master = df_factory.create(
+                      proactor_threads=2, replication_timeout=3000, vmodule="replica=2,dflycmd=2"

Collaborator

adiholden Oct 20, 2024

why specify proactor_threads=2 ?

Contributor Author

kostasrim Oct 23, 2024

Because it was easier to reproduce with 2 threads. I haven;t tried with more so maybe the test requires tuning.

kostasrim commented

View reviewed changes

tests/dragonfly/replication_test.py Outdated

@@ @@ -2295,7 +2295,9 @@ async def test_announce_ip_port(df_factory): @@
               @pytest.mark.asyncio
               async def test_replication_timeout_on_full_sync(df_factory: DflyInstanceFactory, df_seeder_factory):
                   # setting replication_timeout to a very small value to force the replica to timeout
-                  master = df_factory.create(replication_timeout=100, vmodule="replica=2,dflycmd=2")
+                  master = df_factory.create(
+                      proactor_threads=2, replication_timeout=100, vmodule="replica=2,dflycmd=2"

Contributor Author

kostasrim Oct 21, 2024

same here, no need of proactor_threads=2

tests/dragonfly/replication_test.py Outdated


		await asyncio.sleep(1)
		await asyncio.sleep(10)

Contributor Author

kostasrim Oct 21, 2024

nope, it was accidental

src/server/engine_shard.h Outdated

                 void Heartbeat();
                 void RetireExpiredAndEvict();
-                void RunPeriodic(std::chrono::milliseconds period_ms, std::function<void()> shard_handler);
+                void RunHeartbeatPeriodic(std::chrono::milliseconds period_ms,

Contributor Author

kostasrim Oct 21, 2024

I wanted to keep the callback just in case we want to use it the future but I agree it redundant. I will remove

kostasrim added 3 commits

October 23, 2024 16:47


          comments

b21f01d


          Merge branch 'main' into kpr1

6b4233f


          remove comments

9fb76ac

kostasrim requested a review from adiholden

October 23, 2024 13:49

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet