Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix assertion failure from erroneous unlock in Adaptive OBIM. #350

Conversation

insertinterestingnamehere
Copy link
Member

Fixes #349, though I'm not entirely sure why yet.

@bozhiyou
Copy link
Member

The problem is actually not to do with the lock ownership. See the inline comments.

@@ -296,8 +296,7 @@ struct AdaptiveOrderedByIntegerMetric : private boost::noncopyable {
((double)(p.slowPopsLastPeriod) / (double)(p.sinceLastFix)) >
1.0 / (double)(chunk_size)) {
for (unsigned i = 1; i < runtime::activeThreads; ++i) {
while (current.getRemote(i)->lock.try_lock())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here it is the culprit - something is missing!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great find. That totally explains why it was broken before. Thanks!

@@ -358,8 +357,7 @@ struct AdaptiveOrderedByIntegerMetric : private boost::noncopyable {
delta = 0;

for (unsigned i = 1; i < runtime::activeThreads; ++i) {
while (current.getRemote(i)->lock.try_lock())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto!

@@ -473,8 +471,7 @@ struct AdaptiveOrderedByIntegerMetric : private boost::noncopyable {

void push(const value_type& val) {
ThreadData& p = *current.getLocal();
while (!p.lock.try_lock())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The explicit lock here is fine. I prefer to start a new issue/pr for replacing some explicit lock with std::lock_guard if necessary.

Copy link
Member

@bozhiyou bozhiyou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A real issue is that numPushesThisStep can be 0, not to be divided.

: private boost::noncopyable,
public internal::AdaptiveOrderedByIntegerMetricComparator<Index,
UseDescending> {
template <typename _T>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually moved the underscore to after the names deliberately since underscore prefixes followed by a capital letter are reserved symbols for the compiler. That's an issue that's present throughout the current codebase though, so we can just address it here later along with everything else.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. I was just making it resemble OBIM's code without criticism on the old OBIM code. I should have discussed it with you. Thanks!

for (unsigned i = 1; i < runtime::activeThreads; ++i) {
while (current.getRemote(i)->lock.try_lock())
while (!data.getRemote(i)->lock.try_lock())
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just do an explicit call to lock instead of while loops? It seems simpler and may be slightly better for performance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I'm not quite aware of the difference between try_lock and lock - Idk why they were designed separately.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a performance thing. I'm not sure about the details involved though.

}
}
#ifdef UNMERGE_ENABLED
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block was originally disabled. Did you mean to re-enable it? I assume they disabled it for performance reasons.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this macro-control a template parameter, which is still false/disabled by default. To me, it doesn't make sense to keep a statically disabled code block in the codebase.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome. I missed that change. That's a great idea!

namespace internal {

template <typename Index, bool UseDescending>
struct AdaptiveOrderedByIntegerMetricComparator {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When porting the code to the latest version I actually set things up to remove any reliance on the OBIM-style internal comparator classes since they really only exist to manage the direction the comparisons run. It's not a huge deal one way or the other, but I'd ultimately like to move OBIM and Adaptive OBIM away from using these helper classes since they're very verbose for what they actually are doing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. Again, I was matching the code structure with OBIM assuming that OBIM's code was in a good shape. Now OBIM should be the one to change.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yah, these are really easy to change. I can just take care of it in a separate PR after we get this one sorted out.

@@ -296,8 +296,7 @@ struct AdaptiveOrderedByIntegerMetric : private boost::noncopyable {
((double)(p.slowPopsLastPeriod) / (double)(p.sinceLastFix)) >
1.0 / (double)(chunk_size)) {
for (unsigned i = 1; i < runtime::activeThreads; ++i) {
while (current.getRemote(i)->lock.try_lock())
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great find. That totally explains why it was broken before. Thanks!

Copy link
Member

@bozhiyou bozhiyou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've fixed as many issues with ill-logic as I could so far. Other changes are just to conform with the old OBIM code assuming that OBIM is well-implemented. I prefer to make the code uniform and even considered making AdaptiveOBIM a derived class of OBIM since there is a bunch of repeated code.

Although this doesn't affect our current project, I prefer to somehow fix it now rather than revisit this later - or never, we don't know when. It's handy and does not take too much time.

@insertinterestingnamehere
Copy link
Member Author

Okay, is this good to go now? It seems like the immediate outstanding issues for this PR have been taken care of.

@bozhiyou
Copy link
Member

Okay, is this good to go now? It seems like the immediate outstanding issues for this PR have been taken care of.

I think so given that the implementation details w.r.t. OBIM will be handled in another PR. Will merge then. Please also test it on A* or whatever since I only test it with sssp to make sure nothing's error prone.

@insertinterestingnamehere
Copy link
Member Author

WRT subclassing, that's an interesting question. It seems like something that could maybe work, but I'm not sure whether or not we'd end up with a net simplification. For example, the UniformBSP parameter doesn't really make sense for the adaptive OBIM scheduler.

If this is in a working state, I'd like to postpone further modifications/simplifications until after the upcoming paper deadline. We've got a lot left to do for that.

@insertinterestingnamehere
Copy link
Member Author

Yah, I've just been testing it on SSSP and FMM too. The A* code needs to be forward ported like this was, so I'll tackle it after a few deadlines have passed.

@insertinterestingnamehere insertinterestingnamehere merged commit 1eba7f1 into IntelligentSoftwareSystems:master Jul 27, 2020
@insertinterestingnamehere
Copy link
Member Author

I was just testing out the new version and now it's deadlocking in SSSP. Any idea why?

@bozhiyou
Copy link
Member

Works fine here. Could you please provide the command line?

A known issue is that gcc complains somewhere using '*' as the multiplication operator in boolean expressions which by default is dereferencing. Will commit a patch soon.

@bozhiyou
Copy link
Member

Just pushed a patch to the repo on the cdgc server but it seems not sync'ed here.

@insertinterestingnamehere
Copy link
Member Author

Yah, it's not automatically synchronized. If you push something to one place you should also push to the other.

@insertinterestingnamehere
Copy link
Member Author

#351 is the concrete test case I'm using.

@insertinterestingnamehere
Copy link
Member Author

Pushed you patch to GitHub.

@bozhiyou
Copy link
Member

Thanks! Let's move to the new PR then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Locking issue with AdaptiveObim (PMOD)
2 participants