Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[L0] fix a deadlock on a recursive event rwlock #1468

Merged
merged 1 commit into from
Mar 25, 2024

Conversation

pbalcer
Copy link
Contributor

@pbalcer pbalcer commented Mar 22, 2024

L0, when creating a list of events to wait on, in some cases was first grabbing a lock on a potentially completed event, and then tried to get a command list, which sometimes needs to cleanup all completed events. This caused a deadlock.

This patch moves getting a command list to before the event lock. But because the lock is required to decide whether this command list actually needed, we might be wasting time here.

@pbalcer pbalcer requested a review from a team as a code owner March 22, 2024 14:41
@codecov-commenter
Copy link

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 12.43%. Comparing base (78ef1ca) to head (3a64d79).
Report is 199 commits behind head on main.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1468      +/-   ##
==========================================
- Coverage   14.82%   12.43%   -2.40%     
==========================================
  Files         250      241       -9     
  Lines       36220    36242      +22     
  Branches     4094     4111      +17     
==========================================
- Hits         5369     4506     -863     
- Misses      30800    31732     +932     
+ Partials       51        4      -47     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@nrspruit nrspruit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This matches what I was seeing in the logs during my debug.

Thanks for finding the right place to move the recursive call!

L0, when creating a list of events to wait on, in some cases
was first grabbing a lock on a potentially completed event,
and then tried to get a command list, which sometimes needs to
cleanup all completed events. This caused a deadlock.

This patch moves getting a command list to before the event lock.
But because the lock is required to decide whether this command
list actually needed, we might be wasting time here.
@pbalcer pbalcer force-pushed the l0-recursive-event-deadlock branch from 3a64d79 to 39c0023 Compare March 22, 2024 15:52
@pbalcer
Copy link
Contributor Author

pbalcer commented Mar 22, 2024

intel/llvm#13112

@pbalcer pbalcer added the v0.9.x Include in the v0.9.x release label Mar 25, 2024
@aarongreig aarongreig merged commit 8dba1fd into oneapi-src:main Mar 25, 2024
49 of 50 checks passed
kbenzie pushed a commit to kbenzie/unified-runtime that referenced this pull request Apr 8, 2024
…adlock

[L0] fix a deadlock on a recursive event rwlock
@kbenzie kbenzie mentioned this pull request Apr 16, 2024
19 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
v0.9.x Include in the v0.9.x release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants