-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added SmartSwitch support in chassisd and enabling chassisd #467
base: master
Are you sure you want to change the base?
Conversation
d5ed9fc
to
0785bb1
Compare
platform identifier.
temp fix to get the build passing.
What are the states supported by the DPUs in the Smart Switch? |
”dpu_midplane_link_state” |
accidental removal of a few lines from the original chassisd file
sonic-chassisd/scripts/chassisd
Outdated
@@ -184,7 +188,7 @@ class ModuleUpdater(logger.Logger): | |||
self.chassis = chassis |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that instead of modifying the existing ModuleUpdater
class, we should implement SmartSwitchModuleUpdater(ModuleUpdater)
. SmartSwitchModuleUpdater
should be derived from ModuleUpdater
and overwrite the methods that should behave differently for the Smart Switch. This approach allows us to keep the original implementation untouched and guarantees full backward compatibility with the chassisd.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will consider this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implemented this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as commented
|
||
if op == 'SET': | ||
admin_state = MODULE_ADMIN_DOWN | ||
elif op == 'DEL': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rameshraghupathy on DEL of the admin state why do module ADMIN UP?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rameshraghupathy on DEL of the admin state why do module ADMIN UP?
@prgeor The original behavior is: The default is no config - meaning admin up when the user wants to shutdown a module adds a config "admins_status: down". So, the SET operation has been used for "admins_status: down" and DEL has been used for setting the module up. If you want we can change this for Smartswitch. Please let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rameshraghupathy didn't we agree to keep the default behavior on missing of the config DB admin status to have dark mode enabled as default? @vvolam is that not captured in the HLD?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rameshraghupathy didn't we agree to keep the default behavior on missing of the config DB admin status to have dark mode enabled as default? @vvolam is that not captured in the HLD?
@prgeor Yes, that is taken care by "def set_initial_dpu_admin_state(self):"
sonic-chassisd/scripts/chassisd
Outdated
@@ -73,6 +73,7 @@ CHASSIS_MODULE_REBOOT_INFO_TABLE = 'CHASSIS_MODULE_REBOOT_INFO_TABLE' | |||
CHASSIS_MODULE_REBOOT_TIMESTAMP_FIELD = 'timestamp' | |||
CHASSIS_MODULE_REBOOT_REBOOT_FIELD = 'reboot' | |||
DEFAULT_LINECARD_REBOOT_TIMEOUT = 180 | |||
DEFAULT_DPU_REBOOT_TIMEOUT = 180 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is 180 secs sufficient for entire DPU reboot time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, should we add this to platform.json as vendor configurable timeout?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may not be sufficient for a reboot as I told you. This is good for shutdown.
Here is what may work:
shutdown: pre_shutdown_time + graceful shutdown time + 180s for proper HW shutdown (240s is safe)
reboot: shutdown (240s) + hw off-on requirement if any (say 30s) + 180s for power on
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A vendor configurable option is a good thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can pick a number now and optimize it as we move.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rameshraghupathy during the HDL review sessions we agreed that this value is vendor specific and should be taken from platform.json
. Please change the implementation based on our agreement
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@oleksandrivantsiv Yes, agree, but the existing modular-chassis LC use "platform_env.conf". Shall we be consistent with that?
PLATFORM_ENV_CONF_FILE = "/usr/share/sonic/platform/platform_env.conf"
self.linecard_reboot_timeout = DEFAULT_LINECARD_REBOOT_TIMEOUT
if os.path.isfile(PLATFORM_ENV_CONF_FILE):
with open(PLATFORM_ENV_CONF_FILE, 'r') as file:
for line in file:
field = line.split('=')[0].strip()
if field == "linecard_reboot_timeout":
self.linecard_reboot_timeout = int(line.split('=')[1].strip())
:return: | ||
""" | ||
|
||
def module_config_update(self, key, admin_state): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rameshraghupathy could you please add a quick summary of function here and also remaining functions for future clarity?
sonic-chassisd/scripts/chassisd
Outdated
self.chassis = chassis | ||
# The threading.Lock is required for concurrent setting of admin_state | ||
# on multiple DPUs for faster bring-up and shutdown. | ||
self.lock = threading.Lock() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rameshraghupathy which data is be protected against multiple thread access?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@prgeor It is hard to predict and say what vendor API would do under the "submit_callback". When multiple threads are forked chassis/modules resources can be corrupted. One thread can increase the chassis.number_of_modules and the other can decrease it causing data corruption and unpredictable states and behavior. In fact it is good have another one in submit_callback as well. But I haven't tested putting a lock in submit_callback.
"""Send admin_state trigger once to modules those are powered up""" | ||
lock = threading.Lock() | ||
threads = [] | ||
for module_index in range(0, self.module_updater.num_modules): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rameshraghupathy I am not sure why we cannot handle all DPUs in one single thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rameshraghupathy I am not sure why we cannot handle all DPUs in one single thread.
@prgeor It is not that we can't. We can but the call goes sequentially to each DPU and power cycling is a time consuming process and each DPU requires roughly 3 mins. So to power cycle a chassis we need 24+ mins. When we fork all 8 finishes in 3 mins
sonic-chassisd/scripts/chassisd
Outdated
with self.lock: | ||
# Submit the callback function as a separate thread. It is the responsibility | ||
# of the module to handle multiple b2b requests to the same module | ||
t = threading.Thread(target=self.submit_callback, args=(module_index, admin_state)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lock is valid until the start of the thread. It is safe to have that as multiple threads may access common resources in chassis and modules causing data corruption. Once the thread is created it has its own stack. In fact it is good to have a lock inside the submit_callback. It is hard to find what common resources are accessed inside each vendor API. So, it is better to have a lock even in submit_callback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This lock doesn’t protect anything. It is created and acquired in the main thread, but none of the child threads use it.
sonic-chassisd/scripts/chassisd
Outdated
def handle_dpu_logic(): | ||
try: | ||
with lock: | ||
module = self.module_updater.chassis.get_module(module_index) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The module_index
and op
variables here are module specific, so these variables should be function arguments for handle_dpu_logic
. With the current implementation there is a good probability that the module_index
and op
are already updated by the time a thread has started for the previous module, Example scenario:
DPU1 : op = MODULE_ADMIN_UP and module_index = 1
thread.start() is called (but thread has still not started yet or lock is not yet obtained - > this is thread1)
We move to the next element in the for loop in for module_index in range(0, self.module_updater.num_modules):
DPU2 : op = MODULE_ADMIN_DOWN and module_index = 2
thread.start() is called -> this is thread2.
At this point if thread1 starts executing the DPU2 related admin down operation is handled while the DPU1 admin up operation is skipped (Since op
and module_index
are already updated)
if module_info_dict is not None: | ||
# Get admin state of DPU | ||
key = module_info_dict[CHASSIS_MODULE_INFO_NAME_FIELD] | ||
admin_state = self.module_updater.get_module_admin_status(key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per this comment and this comment the set_initial_dpu_admin_state
is supposed to handle the dark mode requirement (If there is no CONFIG_DB CHASSIS_MODULE
entries for the DPU then the admin state for the DPU should be down) but in this function, the get_module_admin_status
would return 'up'
by default if there is no CONFIG_DB entry, meaning op = 'up'
which would power on the dpu, but by default the DPU should be powered off if there is no CONFIG_DB entry
chassisd db clean up on admin_down state, removed the thread locks
@@ -75,13 +75,25 @@ def set_midplane_reachable(self, up): | |||
def get_all_asics(self): | |||
return self.asic_list | |||
|
|||
def get_reboot_cause(self): | |||
reboot_cause = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does this schema come from? This not what is written in the HLD:
Retrieves the cause of the previous reboot of the DPU module
Returns:
A tuple (string, string) where the first element is a string containing the cause of the previous reboot. This string must be one of the predefined strings in this class. If the first string is "REBOOT_CAUSE_HARDWARE_OTHER", the second string can be used to pass a description of the reboot cause.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sonic-chassisd/scripts/chassisd
Outdated
if current_status != ModuleBase.MODULE_STATUS_OFFLINE: | ||
# DPU module has been turned on, update the reboot-cause | ||
# If the reboot-cause file not already present save it | ||
# If the STATE_DB REBOOT_CAUSE table already doesn't have |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why STATE_DB? Should it be CHASSIS_STATE_DB?
sonic-chassisd/scripts/chassisd
Outdated
# If the reboot-cause file not already present save it | ||
# If the STATE_DB REBOOT_CAUSE table already doesn't have | ||
# an entry for the recent reboot update the DB | ||
reboot_cause = try_get(self.chassis.get_module(module_index).get_reboot_cause) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is the reboot cause populated in the CHASSIS_DB? I don't see the implementation for this. The reboot cause is read into a local variable and that is it. Nothing else is done with the variable
Please implement a unit test for reboot_cause implementation |
@rameshraghupathy, @prgeor please extend the HLD with the flows for reboot_case and reboot_case_history. From this implementation, it is not clear how it works for the platforms that support the |
seamless migration to lightup mode, dpu_state db update when midplane fails, persisting dpu reboot cause etc
91d0a47
to
43e6b61
Compare
converting it to string format
Added SmartSwitch support in chassisd and enabling chassisd for fixed SmartSwitches
Description
chassisd is enabled only for modular chassis. Smartswitch is a fixed chassis. However it has been treated like a modular chassis to manage the DPU cards just like the line-cards of a modular chassis. chassisd will be enabled only on the smartswitch NPU and hence the scope of these changes are limited only to NPU and not applicable to DPU.
Motivation and Context
chassisd is enabled only for modular chassis. Smartswitch is a fixed chassis. However it has been treated like a modular chassis to manage the DPU cards just like the line-cards of a modular chassis. Hence, chassid is needed for SmartSwitch and enabled here. Also, some of the table updates and clean up are not required for SmartSwitch platform and hence using the is_smartswitch API to selectively enable it. the text "fixes #xxxx", "closes #xxxx" or "resolves #xxxx" here
How Has This Been Tested?
Additional Information (Optional)