You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The problem is totally reproducible with -np 4, even with all debugs turned on, so it is not some kind of race condition. I have done some massive debug log dumps and confirmed that it thinks that a total of 30 cutpoints were sent and 44 were received. That's weird, and will definitely keep the code from terminating. Will try to diagnose it, probably tomorrow.
I looked at the logs and the thread that shares cut points between processors is totally verschimmelt. It's doing all sorts of unintended stuff and needs to be reworked.
This bug is a side effect of one of its unintended behaviors. It is generating messages between processors even during ramp-up, which it should not. For this particular problem on 4 processors, it turns out that the first problem solves in ramp-up. PEBBL doesn't expect stray messages to be floating around at this point, so it just terminates and controls goes back to REPR. The excess messages sit around and are received at the end of the ramp-up of the next RMA problem, skewing the counts, and then things unravel.
I could not reproduce this error using the parallel RMA run only.
JE: "One possible explanation is message sent and received counts not matching up."
The text was updated successfully, but these errors were encountered: