-
Notifications
You must be signed in to change notification settings - Fork 132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix assertion failure from erroneous unlock in Adaptive OBIM. #350
Fix assertion failure from erroneous unlock in Adaptive OBIM. #350
Conversation
The problem is actually not to do with the lock ownership. See the inline comments. |
@@ -296,8 +296,7 @@ struct AdaptiveOrderedByIntegerMetric : private boost::noncopyable { | |||
((double)(p.slowPopsLastPeriod) / (double)(p.sinceLastFix)) > | |||
1.0 / (double)(chunk_size)) { | |||
for (unsigned i = 1; i < runtime::activeThreads; ++i) { | |||
while (current.getRemote(i)->lock.try_lock()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here it is the culprit - something is missing!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great find. That totally explains why it was broken before. Thanks!
@@ -358,8 +357,7 @@ struct AdaptiveOrderedByIntegerMetric : private boost::noncopyable { | |||
delta = 0; | |||
|
|||
for (unsigned i = 1; i < runtime::activeThreads; ++i) { | |||
while (current.getRemote(i)->lock.try_lock()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto!
@@ -473,8 +471,7 @@ struct AdaptiveOrderedByIntegerMetric : private boost::noncopyable { | |||
|
|||
void push(const value_type& val) { | |||
ThreadData& p = *current.getLocal(); | |||
while (!p.lock.try_lock()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The explicit lock here is fine. I prefer to start a new issue/pr for replacing some explicit lock with std::lock_guard if necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A real issue is that numPushesThisStep can be 0, not to be divided.
: private boost::noncopyable, | ||
public internal::AdaptiveOrderedByIntegerMetricComparator<Index, | ||
UseDescending> { | ||
template <typename _T> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually moved the underscore to after the names deliberately since underscore prefixes followed by a capital letter are reserved symbols for the compiler. That's an issue that's present throughout the current codebase though, so we can just address it here later along with everything else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point. I was just making it resemble OBIM's code without criticism on the old OBIM code. I should have discussed it with you. Thanks!
for (unsigned i = 1; i < runtime::activeThreads; ++i) { | ||
while (current.getRemote(i)->lock.try_lock()) | ||
while (!data.getRemote(i)->lock.try_lock()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just do an explicit call to lock instead of while loops? It seems simpler and may be slightly better for performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I'm not quite aware of the difference between try_lock and lock - Idk why they were designed separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a performance thing. I'm not sure about the details involved though.
} | ||
} | ||
#ifdef UNMERGE_ENABLED |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block was originally disabled. Did you mean to re-enable it? I assume they disabled it for performance reasons.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made this macro-control a template parameter, which is still false/disabled by default. To me, it doesn't make sense to keep a statically disabled code block in the codebase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome. I missed that change. That's a great idea!
namespace internal { | ||
|
||
template <typename Index, bool UseDescending> | ||
struct AdaptiveOrderedByIntegerMetricComparator { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When porting the code to the latest version I actually set things up to remove any reliance on the OBIM-style internal comparator classes since they really only exist to manage the direction the comparisons run. It's not a huge deal one way or the other, but I'd ultimately like to move OBIM and Adaptive OBIM away from using these helper classes since they're very verbose for what they actually are doing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. Again, I was matching the code structure with OBIM assuming that OBIM's code was in a good shape. Now OBIM should be the one to change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yah, these are really easy to change. I can just take care of it in a separate PR after we get this one sorted out.
@@ -296,8 +296,7 @@ struct AdaptiveOrderedByIntegerMetric : private boost::noncopyable { | |||
((double)(p.slowPopsLastPeriod) / (double)(p.sinceLastFix)) > | |||
1.0 / (double)(chunk_size)) { | |||
for (unsigned i = 1; i < runtime::activeThreads; ++i) { | |||
while (current.getRemote(i)->lock.try_lock()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great find. That totally explains why it was broken before. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've fixed as many issues with ill-logic as I could so far. Other changes are just to conform with the old OBIM code assuming that OBIM is well-implemented. I prefer to make the code uniform and even considered making AdaptiveOBIM a derived class of OBIM since there is a bunch of repeated code.
Although this doesn't affect our current project, I prefer to somehow fix it now rather than revisit this later - or never, we don't know when. It's handy and does not take too much time.
Okay, is this good to go now? It seems like the immediate outstanding issues for this PR have been taken care of. |
I think so given that the implementation details w.r.t. OBIM will be handled in another PR. Will merge then. Please also test it on A* or whatever since I only test it with sssp to make sure nothing's error prone. |
WRT subclassing, that's an interesting question. It seems like something that could maybe work, but I'm not sure whether or not we'd end up with a net simplification. For example, the UniformBSP parameter doesn't really make sense for the adaptive OBIM scheduler. If this is in a working state, I'd like to postpone further modifications/simplifications until after the upcoming paper deadline. We've got a lot left to do for that. |
Yah, I've just been testing it on SSSP and FMM too. The A* code needs to be forward ported like this was, so I'll tackle it after a few deadlines have passed. |
I was just testing out the new version and now it's deadlocking in SSSP. Any idea why? |
Works fine here. Could you please provide the command line? A known issue is that gcc complains somewhere using '*' as the multiplication operator in boolean expressions which by default is dereferencing. Will commit a patch soon. |
Just pushed a patch to the repo on the cdgc server but it seems not sync'ed here. |
Yah, it's not automatically synchronized. If you push something to one place you should also push to the other. |
#351 is the concrete test case I'm using. |
Pushed you patch to GitHub. |
Thanks! Let's move to the new PR then. |
Fixes #349, though I'm not entirely sure why yet.