PEBBL's termination criteria are not met in parrallel boosting #29

aik7 · 2021-01-25T10:25:26Z

I could not reproduce this error using the parallel RMA run only.

JE: "One possible explanation is message sent and received counts not matching up."

$ mpirun -np 4 ./build/boosting --isUseGurobi=true --isSaveModel=true --isStandData=false --c=1.0 --e=1.0 --p=2 --tolStopCond=0.00001 --numIterations=200 data.txt
User-specified solver options:
isUseGurobi true
isSaveModel true
isStandData false
c 1.0
e 1.0
p 2
tolStopCond 0.00001
numIterations 200

Train (# of observations, # of attributes): 149 9
Gurobi solver
File exists
Master Solution: 16.7799         CPU Time: 0.00
[0] Using default values for all solver options
[0] Iter: 0     TrainMSE: 0.079643
GRMA Solution: 51.0000  CPU Time: 0.00
ERMA Solution: 51.0000  CPU time: 0.00  Num of Nodes: 3
Master Solution: 13.8058         CPU Time: 0.00
[0] -Iter: 1    TrainMSE: 0.079643
GRMA Solution: 22.5000  CPU Time: 0.00
[0] -h#21 pool=0  inc=22.5000000

The text was updated successfully, but these errors were encountered:

aik7 · 2021-01-25T11:06:47Z

It did not happen using -np 2
Using --debug=10 option, you can see the following output.

[2] Result is {0:7} -1 (Dead)
[0] h#11 pool=0  inc=12.0000000

aik7 · 2021-01-25T14:41:55Z

With the --fracCachedCutPts=1 option (no cutpoint caching), I did not get this error.

jeckstei · 2021-01-25T16:56:27Z

The problem is totally reproducible with -np 4, even with all debugs turned on, so it is not some kind of race condition. I have done some massive debug log dumps and confirmed that it thinks that a total of 30 cutpoints were sent and 44 were received. That's weird, and will definitely keep the code from terminating. Will try to diagnose it, probably tomorrow.

jeckstei · 2021-01-27T16:08:09Z

I looked at the logs and the thread that shares cut points between processors is totally verschimmelt. It's doing all sorts of unintended stuff and needs to be reworked.

This bug is a side effect of one of its unintended behaviors. It is generating messages between processors even during ramp-up, which it should not. For this particular problem on 4 processors, it turns out that the first problem solves in ramp-up. PEBBL doesn't expect stray messages to be floating around at this point, so it just terminates and controls goes back to REPR. The excess messages sit around and are received at the end of the ramp-up of the next RMA problem, skewing the counts, and then things unravel.

aik7 added the bug Something isn't working label Jan 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PEBBL's termination criteria are not met in parrallel boosting #29

PEBBL's termination criteria are not met in parrallel boosting #29

aik7 commented Jan 25, 2021

aik7 commented Jan 25, 2021

aik7 commented Jan 25, 2021

jeckstei commented Jan 25, 2021

jeckstei commented Jan 27, 2021

PEBBL's termination criteria are not met in parrallel boosting #29

PEBBL's termination criteria are not met in parrallel boosting #29

Comments

aik7 commented Jan 25, 2021

aik7 commented Jan 25, 2021

aik7 commented Jan 25, 2021

jeckstei commented Jan 25, 2021

jeckstei commented Jan 27, 2021