-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
run_eicrecon and eicrecon do not handle SIGINT well (leave zombie in case of run_eicrecon) #310
Comments
I was able to reproduce this. The problem only occurs if the signal is sent during start-up. If you wait until the geometry starts to load, then it will exit as expected. A bit of context: The "Exiting Gracefully ... " message means JANA will allow any events currently being processed to finish before it stops. The main use-case is a user wants to end processing early, but cleanly close the root file so the processing that had been done up to that point is saved. On the command line, one can issue a second and even third Ctl-C and it will attempt more aggressive termination. For the case the OP describes though, this still does not help (see below). A copy of the full stack trace taken when the program is in the stalled state is given below. This does point to a bug in JANA. Specifically: a mutex is locked while the I'm looping Nathan in since he will likely want to make the fix. The problem actually gets a bit complicated when you start thinking about how to properly handle all cases (e.g. honor the mutex if in regular event processing, but not if in startup?). This really should be fixed in JANA itself as opposed to trying to addressing it in EICrecon. (gdb) bt |
Negatory. Unless you mean that pressing Ctrl-C during geometry reading is intentionally showing
|
Even pressing Ctrl-C later, during Acts initialization, still completes a full event.
|
At least in those cases there are no zombies. |
Yes, that is exactly what I meant. Both geometry and acts setup are being done during the processing of the first event. Ctl-C there allows that first event to complete so the event sources, sinks, etc. can all run their Finish methods to cleanup. This is what "gracefully" is intended to convey. If you want an immediate stop, you'll have to send a SIGKILL which we cannot catch. Fixing the deadlock problem should allow this to work as intended and closer to your expectation. This does mean a few extra seconds if it occurs during startup. It may also take a couple of seconds for all threads to drain if running on a many-core systems with event processing on the order of 1 sec/thread. However, this ability to signal early stopping with and intact root file by hitting Ctl-C has proven so useful in the past it is hard to imagine not having it. You will become a fan. |
I'm familiar with the concept of trapping signals and cleaning up resources. I just think this usually is a bit faster than the 10 seconds it is here. I don't think users will expect it to complete the current event. |
I'm not so sure. This is what ddsim does. If you hit Ctl-C while it is reading in the geometry, it continues to read it and doesn't exit until it hits |
Again, I mean this also AFTER geometry has been read in. On my system (2 year old laptop, cvmfs eic-shell) it takes 5 seconds from Ctrl-C to command line AFTER the geometry read-in has been completed. Anyway, that's clearly less important than not having zombies and deadlocks. |
Well, we could address this specifically in EICrecon by adding lines just before long initializations are started like:
Doing this just before initializing acts, could save a little time. |
Hi everyone! Sorry to jump in to this thread late. Some thoughts:
|
The zombie |
Environment: (where does this bug occur, have you tried other environments)
main
for latest released): mainHEAD
for the most recent on git): 0.3.4Steps to reproduce: (give a step by step account of how to trigger the bug)
Inside
eic-shell
:source /opt/detector/setup.sh
ddsim --compactFile $DETECTOR_PATH/$DETECTOR_CONFIG.xml -G -N 100 --outputFile ddsim.edm4hep.root
run_eicrecon_reco_flags.py ddsim.edm4hep.root eicrecon
[INFO] Starting processing with 1 threads requested...
has been printed.ps -f
Alternative:
3.
eicrecon ddsim.edm4hep.root -Ppodio:output=eicrecon.tree.edm4hep.root
Expected Result: (what do you expect when you execute the steps above)
When receiving SIGINT, eicrecon should clean up and return to calling shell only when clean-up has completed. Especially when it prints
Exiting gracefully...
to the screen.Actual Result: (what do you get when you execute the steps above)
ps -f
returnsThese zombie processes
lscpu
andeicrcon
do not end unless explicitly killed.When running the alternative with
eicrecon
directly, we don't get any shell, just stuck atExiting gracefully...
, requiring a SIGQUITAfter SIGQUIT no processes remain.
Additional context
In CI tests and on OSG I see simulation production jobs timing out that shouldn't time out. On CI, we run 100 events so they should not timeout at 6 hours unless something has gone wrong and a process is hanging. On OSG, we target 2 hour run times, and jobs should never reach the 20 hours timeout mark. Yet, that is what is happening. I worry that these processes may be interrupted for some reason (maybe a different signal, maybe SIGSEGV?), but they don't clean up and the job submission system never sees them end.
The text was updated successfully, but these errors were encountered: