Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BLE can't be used after erasing board and flash BT SHELL app #86444

Open
zhaynxp opened this issue Feb 28, 2025 · 1 comment · May be fixed by #86454
Open

BLE can't be used after erasing board and flash BT SHELL app #86444

zhaynxp opened this issue Feb 28, 2025 · 1 comment · May be fixed by #86454
Assignees
Labels
bug The issue is a bug, or the PR is fixing a bug

Comments

@zhaynxp
Copy link
Contributor

zhaynxp commented Feb 28, 2025

Describe the bug
With latest Zephyr Downstream release, we found one critical issue that BLE didn’t work after erasing board then flash BT SHELL project, it can be reproduced with any RW612 board once erasing the board before test.

  • What target platform are you using?
    NXP RW610/RW612
  • What have you tried to diagnose or workaround this issue?
    After debug further, we found issue was caused by sysworkq which excuted the bt_ready() function got blocked during the ECDH operations through long work queue.

Details:
sysworkq task will call bt_init() -> bt_smp_int()->bt_pub_key_gen()->bt_long_wq task to generate BT public key, but during generating the random data through long work queue in z_impl_sys_csrand_get(), bt_ready() was also called by sysworkq to settings_load() ---->generate random data via same interface, although the task priority of sysworkq was higher than bt_long_wq task, bt_long_wq still completes the operation in z_impl_sys_csrand_get() due to mutex protection, sysworkq was in blocked state during this period. After bt_long_wq task done and released Mutex, it should be sysworkq turn to acquire the ctr_lock mutex per our understanding, But issue was the sysworkq task just blocked there forever and never run into the z_impl_sys_csrand_get() again to complete the bt_ready() call trace.
Attached few pictures for your reference, you can see after issue hit, the call stack suddenly changed to main task which is unexpected.

  • Is this a regression? If yes, have you been able to "git bisect" it to a
    specific commit?
  • yes, it's regression. Previous our v4.1 release at Jan/20 doesn’t have this issue as there is no ECDH enabled nor longwq called from bt_init().

To Reproduce
Steps to reproduce the behavior:

  1. west build -p always -b rd_rw612_bga zephyr/tests/bluetooth/shell -d ble_build/bt_shell
  2. flash zephyr.elf to RW612 board
  3. power on board
  4. bt init, bt scan on
  5. see error

Expected behavior
bt init and bt scan on should all success

Impact
BLE can't be used on any board which erasing flash before using

Logs and console output
*** Booting Zephyr OS build nxp-v4.0.0-13647-g0c6e38f51f3c ***
Type "help" for supported commands.Before any Bluetooth commands you must bt init to initialize the stack.
uart:$ bt init
Bluetooth initialized
[00:00:02.652,169] fs_nvs: 8 Sectors of 4096 bytes
[00:00:02.652,185] fs_nvs: alloc wra: 0, fd0
[00:00:02.652,189] fs_nvs: data wra: 0, 13
[00:00:02.905,481] bt_hci_core: No ID address. App must call settings_load()
uart:
$
uart:$
uart:
$ bt scan on
Bluetooth set active scan failed (err -11)
uart:~$

Environment (please complete the following information):

  • OS: Linux
  • Toolchain: Zephyr SDK
  • Commit SHA or Version used: 9cc8301
Image Image Image

Additional context
We're not able to locate the root cause yet why sysworkq task was blocked during bt init, but refer to this commit :
c7f3ad6
We switched to use the system workq to perform any ECDH operations other than via long workq, so we add CONFIG_BT_LONG_WQ=n to prj.conf and compile bt_Shell app, issue was gone, BLE can work normally after every erasing board. It's just the workaround we made to unblock the BT SHELL using.
Could you please help to fix this issue from root cause? Thanks!

@zhaynxp zhaynxp added the bug The issue is a bug, or the PR is fixing a bug label Feb 28, 2025
@zhaynxp zhaynxp linked a pull request Feb 28, 2025 that will close this issue
@zhaynxp
Copy link
Contributor Author

zhaynxp commented Feb 28, 2025

Issue was root caused and fixed by PR: #86454

zhaynxp added a commit to zhaynxp/Github_zephyr that referenced this issue Feb 28, 2025
Correct Mutex ctr_lock defination as the wrong defination lead to
sysworkq task not acquiring this mutex during bt init, which lead to
BLE didn't work as described in issue zephyrproject-rtos#86444

Signed-off-by: Ying Zhang <[email protected]>
@tomi-font tomi-font linked a pull request Feb 28, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug The issue is a bug, or the PR is fixing a bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant