Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CA-404658: Split heartbeat thread #17

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

BengangY
Copy link
Contributor

  1. Split heartbeat thread
  2. Print thread ID at startup

@BengangY BengangY marked this pull request as ready for review February 14, 2025 10:02
@minglumlu
Copy link

Can you please share the test results w/ and w/o the splitting change?

@BengangY BengangY force-pushed the private/bengangy/CA-404658 branch from addc01e to be67a8c Compare February 18, 2025 13:57
}

// start heartbeat sending thread
ret = pthread_create(&hb_send_thread, xhad_pthread_attr, hb_send, NULL);
if (ret)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to do anything here to cleanup the receiving thread if we fail to start the sending thread?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need. Currently, if heartbeat threads fails to create, the return status of hb_initialize is non-0, then hb_cleanup_objects is called, but it doesn't do anything (code in it will not be compiled).

@BengangY BengangY force-pushed the private/bengangy/CA-404658 branch from be67a8c to 4f0163e Compare February 20, 2025 02:07
@BengangY
Copy link
Contributor Author

Can you please share the test results w/ and w/o the splitting change?

I created a crontab to run "cat /proc/net/udp | grep '02B6'" on each host every 5 minute to record UDP 694 packet drop. The test results are below:

  1. With splitting heartbeat thread:
    HA was running for about 5 days (from Feb 14 7:50 to Feb 19 5:35), there is no any packet drop.
  2. Without splitting heartbeat thread:
  3. HA was running for less than 1 day (from Feb 19 5:40 to Feb 20 2:10), I have observed packet drops learly on each host. The packet drops began at Feb 20 0:25.
    20250215-udp-drop-no-split-heartbeat-thread.txt

@edwintorok
Copy link
Contributor

I've recently added a CI to this repo, if you rebase your branch on top of latest master then we should start to see workflow runs here.

Copy link
Contributor

@alexbrett alexbrett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've run some testing with this change, and in a situation where xha was failing to keep up with receiving heartbeats this does seem to resolve it.

There is a separate effort underway to identify why the receiving is getting 'bogged down', as it isn't a huge load so it should cope even with sending heartbeats occasionally as well, but this change definitely appears to be an improvement so I think is worth taking regardless of the outcome of that investigation.

@BengangY BengangY force-pushed the private/bengangy/CA-404658 branch from 4f0163e to a4bf3d5 Compare February 26, 2025 15:00

MTC_STATIC void *
hb_send(
void *ignore)

Check warning

Code scanning / CodeChecker

unused parameter 'ignore' Warning

unused parameter 'ignore'
Split heartbeat thread into sending
heartbeat thread and receiving heartbeat thread.

Signed-off-by: Bengang Yuan <[email protected]>
@BengangY BengangY force-pushed the private/bengangy/CA-404658 branch 2 times, most recently from e165992 to e57c590 Compare February 27, 2025 10:09
Print all threads' ID in the xha.log at startup.

Signed-off-by: Bengang Yuan <[email protected]>
@BengangY BengangY force-pushed the private/bengangy/CA-404658 branch from e57c590 to 1770cff Compare February 27, 2025 10:27
@BengangY
Copy link
Contributor Author

I've recently added a CI to this repo, if you rebase your branch on top of latest master then we should start to see workflow runs here.

I have rebased on master and resolved the CI. Now it has passed all the checks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants