Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PEBBL's termination criteria are not met in parrallel boosting #29

Open
aik7 opened this issue Jan 25, 2021 · 4 comments
Open

PEBBL's termination criteria are not met in parrallel boosting #29

aik7 opened this issue Jan 25, 2021 · 4 comments
Labels
bug Something isn't working

Comments

@aik7
Copy link
Owner

aik7 commented Jan 25, 2021

I could not reproduce this error using the parallel RMA run only.

JE: "One possible explanation is message sent and received counts not matching up."

$ mpirun -np 4 ./build/boosting --isUseGurobi=true --isSaveModel=true --isStandData=false --c=1.0 --e=1.0 --p=2 --tolStopCond=0.00001 --numIterations=200 data.txt
User-specified solver options:
isUseGurobi true
isSaveModel true
isStandData false
c 1.0
e 1.0
p 2
tolStopCond 0.00001
numIterations 200

Train (# of observations, # of attributes): 149 9
Gurobi solver
File exists
Master Solution: 16.7799         CPU Time: 0.00
[0] Using default values for all solver options
[0] Iter: 0     TrainMSE: 0.079643
GRMA Solution: 51.0000  CPU Time: 0.00
ERMA Solution: 51.0000  CPU time: 0.00  Num of Nodes: 3
Master Solution: 13.8058         CPU Time: 0.00
[0] -Iter: 1    TrainMSE: 0.079643
GRMA Solution: 22.5000  CPU Time: 0.00
[0] -h#21 pool=0  inc=22.5000000
@aik7 aik7 added the bug Something isn't working label Jan 25, 2021
@aik7
Copy link
Owner Author

aik7 commented Jan 25, 2021

  • It did not happen using -np 2
  • Using --debug=10 option, you can see the following output.
[2] Result is {0:7} -1 (Dead)
[0] h#11 pool=0  inc=12.0000000

@aik7
Copy link
Owner Author

aik7 commented Jan 25, 2021

With the --fracCachedCutPts=1 option (no cutpoint caching), I did not get this error.

@jeckstei
Copy link
Collaborator

The problem is totally reproducible with -np 4, even with all debugs turned on, so it is not some kind of race condition. I have done some massive debug log dumps and confirmed that it thinks that a total of 30 cutpoints were sent and 44 were received. That's weird, and will definitely keep the code from terminating. Will try to diagnose it, probably tomorrow.

@jeckstei
Copy link
Collaborator

I looked at the logs and the thread that shares cut points between processors is totally verschimmelt. It's doing all sorts of unintended stuff and needs to be reworked.

This bug is a side effect of one of its unintended behaviors. It is generating messages between processors even during ramp-up, which it should not. For this particular problem on 4 processors, it turns out that the first problem solves in ramp-up. PEBBL doesn't expect stray messages to be floating around at this point, so it just terminates and controls goes back to REPR. The excess messages sit around and are received at the end of the ramp-up of the next RMA problem, skewing the counts, and then things unravel.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants