feat: add more sanity checks for T2 #15253
Open
+126
−1
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description of PR
Add BFD up count check and MAC entries count check to sanity check for T2 topo.
Summary:
Fixes # (issue) Microsoft ADO 29825439 & 29825466
Type of change
Back port request
Approach
What is the motivation for this PR?
During our T2 Nightly run, we found that there will be a chance that the port channel connection between 2 ASICs is up but MAC address was not learned and the BFD session between them is down. Therefore, we need to have sanity check to make sure BFD are all up and all MAC addresses are learned, otherwise issue like this will affect the test result and can impact production env.
How did you do it?
check_bfd_up_count()
function to sanity check for T2 topo only. This check will take ~4 seconds to run on a T2 device with 3 LC (frontend nodes).check_mac_entry_count()
function to sanity check for T2 supervisor only. This check will take ~17 seconds to finish on a T2 device where its supervisor has 10 ASICs.How did you verify/test it?
Run the updated code on T2 and can confirm it's checking the BFD up count and MAC entries count properly. Besides, I can also confirm that these 2 checks will be skipped on non-T2 devices.
Any platform specific information?
Supported testbed topology if it's a new test case?
T2
Documentation