Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

run_eicrecon and eicrecon do not handle SIGINT well (leave zombie in case of run_eicrecon) #310

Closed
wdconinc opened this issue Nov 6, 2022 · 11 comments
Assignees
Labels
bug Something isn't working

Comments

@wdconinc
Copy link
Contributor

wdconinc commented Nov 6, 2022

Environment: (where does this bug occur, have you tried other environments)

  • Which branch (often main for latest released): main
  • Which version (or HEAD for the most recent on git): 0.3.4
  • Any specific OS or system where the issue occurs? eic-shell

Steps to reproduce: (give a step by step account of how to trigger the bug)

Inside eic-shell:

  1. source /opt/detector/setup.sh
  2. ddsim --compactFile $DETECTOR_PATH/$DETECTOR_CONFIG.xml -G -N 100 --outputFile ddsim.edm4hep.root
  3. run_eicrecon_reco_flags.py ddsim.edm4hep.root eicrecon
  4. Ctrl-C during eicrecon running, after [INFO] Starting processing with 1 threads requested... has been printed.
  5. ps -f

Alternative:
3. eicrecon ddsim.edm4hep.root -Ppodio:output=eicrecon.tree.edm4hep.root

Expected Result: (what do you expect when you execute the steps above)

When receiving SIGINT, eicrecon should clean up and return to calling shell only when clean-up has completed. Especially when it prints Exiting gracefully... to the screen.

Actual Result: (what do you get when you execute the steps above)

ps -f returns

UID          PID    PPID  C STIME TTY          TIME CMD
wdconinc   37379   36079  0 16:51 pts/2    00:00:00 /bin/bash
wdconinc   37670   37379  0 16:53 pts/2    00:00:00 /bin/bash /home/wdconinc/bin/eic-shell
wdconinc   37674   37670  0 16:53 pts/2    00:00:00 Apptainer runtime parent
wdconinc   37848   37674  0 16:53 pts/2    00:00:00 bash --norc --noprofile
wdconinc   46519       1  0 19:08 pts/2    00:00:00 eicrecon -Pplugins=dump_flags -Pdump_flags:python=all_flags_dump_from_run.py -Pjana:debug_plugin_loading=1 -Pjana:nevents=0 -Pacts:MaterialMap=calibrations/materials-map.cbor -Ppodio:o
wdconinc   46520   46519  0 19:08 pts/2    00:00:00 [lscpu] <defunct>
wdconinc   46576   37848  0 19:09 pts/2    00:00:00 ps -f

These zombie processes lscpu and eicrcon do not end unless explicitly killed.

When running the alternative with eicrecon directly, we don't get any shell, just stuck at Exiting gracefully..., requiring a SIGQUIT

[INFO] Starting processing with 1 threads requested...
[INFO] JArrowProcessingController: run(): Launching 1 workers
[INFO] JArrow: Initializing JEventProcessor 'JEventProcessorPODIO'
[INFO] JArrow: Initializing JEventSource 'ddsim.edm4hep.root' (JEventSourcePODIOsimple)
^C[FATAL] Exiting gracefully...
^\Quit

After SIGQUIT no processes remain.

Additional context

In CI tests and on OSG I see simulation production jobs timing out that shouldn't time out. On CI, we run 100 events so they should not timeout at 6 hours unless something has gone wrong and a process is hanging. On OSG, we target 2 hour run times, and jobs should never reach the 20 hours timeout mark. Yet, that is what is happening. I worry that these processes may be interrupted for some reason (maybe a different signal, maybe SIGSEGV?), but they don't clean up and the job submission system never sees them end.

@wdconinc wdconinc added the bug Something isn't working label Nov 6, 2022
@faustus123
Copy link
Contributor

I was able to reproduce this. The problem only occurs if the signal is sent during start-up. If you wait until the geometry starts to load, then it will exit as expected.

A bit of context:

The "Exiting Gracefully ... " message means JANA will allow any events currently being processed to finish before it stops. The main use-case is a user wants to end processing early, but cleanly close the root file so the processing that had been done up to that point is saved. On the command line, one can issue a second and even third Ctl-C and it will attempt more aggressive termination. For the case the OP describes though, this still does not help (see below).

A copy of the full stack trace taken when the program is in the stalled state is given below. This does point to a bug in JANA. Specifically: a mutex is locked while the JEventSource is run from JArrowTopology::run. Because the signal happens right when the source is initializing, the mutex is still held when the signal handler calls JArrowTopology::drain(). Hence the deadlock. This also explains why it behaves correctly if the signal comes after the event source is initialized.

I'm looping Nathan in since he will likely want to make the fix. The problem actually gets a bit complicated when you start thinking about how to properly handle all cases (e.g. honor the mutex if in regular event processing, but not if in startup?). This really should be fixed in JANA itself as opposed to trying to addressing it in EICrecon.

(gdb) bt
#0 0x00007ff126b7654d in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x00007ff126b71e9b in _L_lock_883 () from /lib64/libpthread.so.0
#2 0x00007ff126b71d68 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x00000000004198d7 in __gthread_mutex_lock (__mutex=0x1bbe5d8)
at /opt/rh/devtoolset-9/root/usr/include/c++/9/x86_64-redhat-linux/bits/gthr-default.h:749
#4 0x000000000041a474 in std::mutex::lock (this=0x1bbe5d8) at /opt/rh/devtoolset-9/root/usr/include/c++/9/bits/std_mutex.h:100
#5 0x000000000041cfb0 in std::lock_guardstd::mutex::lock_guard (this=0x7fff4ba021f8, __m=...)
at /opt/rh/devtoolset-9/root/usr/include/c++/9/bits/std_mutex.h:159
#6 0x00000000004536f9 in JArrowTopology::drain() ()
#7 0x000000000044cf00 in JArrowProcessingController::request_stop() ()
#8 0x000000000042c034 in JApplication::Stop(bool) ()
#9 0x000000000042c0bb in JApplication::Quit(bool) ()
#10 0x0000000000415246 in JSignalHandler::handle_sigint ()
at /home/davidl/work/2022.07.11.EICrecon/EICTOP/JANA/v2.0.7/include/JANA/CLI/JSignalHandler.h:120
#11
#12 0x00007ff119d09dbc in clang::ASTReader::getGlobalDeclID(clang::serialization::ModuleFile&, unsigned int) const [clone .part.0] ()
from /home/davidl/work/2022.07.11.EICrecon/EICTOP/root/root-6.26.04/lib/libCling.so
#13 0x00007ff119d4a805 in clang::ASTDeclReader::VisitDecl(clang::Decl*) ()
from /home/davidl/work/2022.07.11.EICrecon/EICTOP/root/root-6.26.04/lib/libCling.so
#14 0x00007ff119d4b7c1 in clang::ASTDeclReader::VisitNamedDecl(clang::NamedDecl*) ()
from /home/davidl/work/2022.07.11.EICrecon/EICTOP/root/root-6.26.04/lib/libCling.so
#15 0x00007ff119d4b981 in clang::ASTDeclReader::VisitValueDecl(clang::ValueDecl*) ()
from /home/davidl/work/2022.07.11.EICrecon/EICTOP/root/root-6.26.04/lib/libCling.so
#16 0x00007ff119d4ba19 in clang::ASTDeclReader::VisitDeclaratorDecl(clang::DeclaratorDecl*) ()
from /home/davidl/work/2022.07.11.EICrecon/EICTOP/root/root-6.26.04/lib/libCling.so
#17 0x00007ff119d67c81 in clang::ASTDeclReader::VisitVarDeclImpl(clang::VarDecl*) ()
from /home/davidl/work/2022.07.11.EICrecon/EICTOP/root/root-6.26.04/lib/libCling.so
#18 0x00007ff119d6860f in clang::ASTDeclReader::VisitParmVarDecl(clang::ParmVarDecl*) ()
from /home/davidl/work/2022.07.11.EICrecon/EICTOP/root/root-6.26.04/lib/libCling.so
#19 0x00007ff119d6f352 in clang::ASTDeclReader::Visit(clang::Decl*) ()
from /home/davidl/work/2022.07.11.EICrecon/EICTOP/root/root-6.26.04/lib/libCling.so
#20 0x00007ff119d6fe0c in clang::ASTReader::ReadDeclRecord(unsigned int) ()
from /home/davidl/work/2022.07.11.EICrecon/EICTOP/root/root-6.26.04/lib/libCling.so
#21 0x00007ff119cd8009 in clang::ASTReader::GetDecl(unsigned int) ()
from /home/davidl/work/2022.07.11.EICrecon/EICTOP/root/root-6.26.04/lib/libCling.so
#22 0x00007ff119d6c32d in clang::ASTDeclReader::VisitFunctionDecl(clang::FunctionDecl*) ()
from /home/davidl/work/2022.07.11.EICrecon/EICTOP/root/root-6.26.04/lib/libCling.so
#23 0x00007ff119d6f352 in clang::ASTDeclReader::Visit(clang::Decl*) ()
from /home/davidl/work/2022.07.11.EICrecon/EICTOP/root/root-6.26.04/lib/libCling.so
#24 0x00007ff119d6fe0c in clang::ASTReader::ReadDeclRecord(unsigned int) ()
from /home/davidl/work/2022.07.11.EICrecon/EICTOP/root/root-6.26.04/lib/libCling.so
#25 0x00007ff119cd8009 in clang::ASTReader::GetDecl(unsigned int) ()
from /home/davidl/work/2022.07.11.EICrecon/EICTOP/root/root-6.26.04/lib/libCling.so
#26 0x00007ff119d0b272 in clang::ASTReader::FindExternalLexicalDecls(clang::DeclContext const*, llvm::function_ref<bool (clang::Decl::Kind)>, llvm::SmallVectorImplclang::Decl*&) () from /home/davidl/work/2022.07.11.EICrecon/EICTOP/root/root-6.26.04/lib/libCling.so
--Type for more, q to quit, c to continue without paging--c
#27 0x00007ff119e0e0ae in clang::MultiplexExternalSemaSource::FindExternalLexicalDecls(clang::DeclContext const*, llvm::function_ref<bool (clang::Decl::Kind)>, llvm::SmallVectorImplclang::Decl*&) () from /home/davidl/work/2022.07.11.EICrecon/EICTOP/root/root-6.26.04/lib/libCling.so
#28 0x00007ff11ba23511 in clang::DeclContext::LoadLexicalDeclsFromExternalStorage() const () from /home/davidl/work/2022.07.11.EICrecon/EICTOP/root/root-6.26.04/lib/libCling.so
#29 0x00007ff11ba235b5 in clang::DeclContext::decls_begin() const () from /home/davidl/work/2022.07.11.EICrecon/EICTOP/root/root-6.26.04/lib/libCling.so
#30 0x00007ff119426634 in TCling::LoadEnums(TListOfEnums&) const () from /home/davidl/work/2022.07.11.EICrecon/EICTOP/root/root-6.26.04/lib/libCling.so
#31 0x00007ff12864f941 in TListOfEnums::Load() () from /home/davidl/work/2022.07.11.EICrecon/EICTOP/root/root-6.26.04/lib/libCore.so
#32 0x00007ff1286262b2 in TClass::GetListOfEnums(bool) () from /home/davidl/work/2022.07.11.EICrecon/EICTOP/root/root-6.26.04/lib/libCore.so
#33 0x00007ff128646f1f in TEnum::GetEnum(char const*, TEnum::ESearchAction) () from /home/davidl/work/2022.07.11.EICrecon/EICTOP/root/root-6.26.04/lib/libCore.so
#34 0x00007ff128631de6 in TClass::GetClass(char const*, bool, bool, unsigned long, unsigned long) () from /home/davidl/work/2022.07.11.EICrecon/EICTOP/root/root-6.26.04/lib/libCore.so
#35 0x00007ff1286470a5 in TEnum::GetEnum(char const*, TEnum::ESearchAction) () from /home/davidl/work/2022.07.11.EICrecon/EICTOP/root/root-6.26.04/lib/libCore.so
#36 0x00007ff119444314 in TCling::CheckClassInfo(char const*, bool, bool) () from /home/davidl/work/2022.07.11.EICrecon/EICTOP/root/root-6.26.04/lib/libCling.so
#37 0x00007ff128632132 in TClass::GetClass(char const*, bool, bool, unsigned long, unsigned long) () from /home/davidl/work/2022.07.11.EICrecon/EICTOP/root/root-6.26.04/lib/libCore.so
#38 0x00007ff1286383e3 in TBuildRealData::Inspect(TClass*, char const*, char const*, void const*, bool) () from /home/davidl/work/2022.07.11.EICrecon/EICTOP/root/root-6.26.04/lib/libCore.so
#39 0x00007ff119440456 in TCling::InspectMembers(TMemberInspector&, void const*, TClass const*, bool) () from /home/davidl/work/2022.07.11.EICrecon/EICTOP/root/root-6.26.04/lib/libCling.so
#40 0x00007ff12862904d in TClass::CallShowMembers(void const*, TMemberInspector&, bool) const () from /home/davidl/work/2022.07.11.EICrecon/EICTOP/root/root-6.26.04/lib/libCore.so
#41 0x00007ff128636e09 in TClass::BuildRealData(void*, bool) () from /home/davidl/work/2022.07.11.EICrecon/EICTOP/root/root-6.26.04/lib/libCore.so
#42 0x00007ff128638474 in TBuildRealData::Inspect(TClass*, char const*, char const*, void const*, bool) () from /home/davidl/work/2022.07.11.EICrecon/EICTOP/root/root-6.26.04/lib/libCore.so
#43 0x00007ff119440456 in TCling::InspectMembers(TMemberInspector&, void const*, TClass const*, bool) () from /home/davidl/work/2022.07.11.EICrecon/EICTOP/root/root-6.26.04/lib/libCling.so
#44 0x00007ff12862904d in TClass::CallShowMembers(void const*, TMemberInspector&, bool) const () from /home/davidl/work/2022.07.11.EICrecon/EICTOP/root/root-6.26.04/lib/libCore.so
#45 0x00007ff128636e09 in TClass::BuildRealData(void*, bool) () from /home/davidl/work/2022.07.11.EICrecon/EICTOP/root/root-6.26.04/lib/libCore.so
#46 0x00007ff1280c1d4d in TBufferFile::ReadClassBuffer(TClass const*, void*, int, unsigned int, unsigned int, TClass const*) () from /home/davidl/work/2022.07.11.EICrecon/EICTOP/root/root-6.26.04/lib/libRIO.so
#47 0x00007ff12785227c in TTree::Streamer(TBuffer&) () from /home/davidl/work/2022.07.11.EICrecon/EICTOP/root/root-6.26.04/lib/libTree.so
#48 0x00007ff128152d1e in TKey::ReadObj() () from /home/davidl/work/2022.07.11.EICrecon/EICTOP/root/root-6.26.04/lib/libRIO.so
#49 0x00007ff1281199eb in TDirectoryFile::Get(char const*) () from /home/davidl/work/2022.07.11.EICrecon/EICTOP/root/root-6.26.04/lib/libRIO.so
#50 0x00007ff12780949d in TChain::LoadTree(long long) () from /home/davidl/work/2022.07.11.EICrecon/EICTOP/root/root-6.26.04/lib/libTree.so
#51 0x00007ff12780858f in TChain::GetFile() const () from /home/davidl/work/2022.07.11.EICrecon/EICTOP/root/root-6.26.04/lib/libTree.so
#52 0x00007ff126f955ad in podio::ROOTReader::openFiles (this=0x1bdd2b8, filenames=...) at /home/davidl/work/2022.07.11.EICrecon/EICTOP/PODIO/v00-15/src/ROOTReader.cc:152
#53 0x00007ff126f95d3f in podio::ROOTReader::openFile (this=0x1bdd2b8, filename=...) at /opt/rh/devtoolset-9/root/usr/include/c++/9/bits/stl_uninitialized.h:82
#54 0x00007ff11d375b23 in JEventSourcePODIOsimple::Open (this=0x1bdd230) at /home/davidl/work/2022.09.02.EICrecon/EICrecon/src/services/io/podio/JEventSourcePODIOsimple.cc:138
#55 0x00007ff11d3064a5 in std::__invoke_impl<void, void (JEventSource::)(), JEventSource> (__f=@0x7fff4ba066f0: &virtual JEventSource::Open(), __t=@0x7fff4ba066e8: 0x1bdd230) at /opt/rh/devtoolset-9/root/usr/include/c++/9/bits/invoke.h:73
#56 0x00007ff11d2fac78 in std::__invoke<void (JEventSource::)(), JEventSource> (__fn=@0x7fff4ba066f0: &virtual JEventSource::Open()) at /opt/rh/devtoolset-9/root/usr/include/c++/9/bits/invoke.h:95
#57 0x00007ff11d2f2698 in std::call_once<void (JEventSource::)(), JEventSource>(std::once_flag&, void (JEventSource::&&)(), JEventSource&&)::{lambda()#1}::operator()() const (this=0x7fff4ba06650) at /opt/rh/devtoolset-9/root/usr/include/c++/9/mutex:671
#58 0x00007ff11d2f26c7 in std::call_once<void (JEventSource::)(), JEventSource>(std::once_flag&, void (JEventSource::&&)(), JEventSource&&)::{lambda()#2}::operator()() const (this=0x0) at /opt/rh/devtoolset-9/root/usr/include/c++/9/mutex:676
#59 0x00007ff11d2f26d8 in std::call_once<void (JEventSource::)(), JEventSource>(std::once_flag&, void (JEventSource::&&)(), JEventSource&&)::{lambda()#2}::_FUN() () at /opt/rh/devtoolset-9/root/usr/include/c++/9/mutex:676
#60 0x00007ff126b6e20b in __pthread_once_slow () from /lib64/libpthread.so.0
#61 0x00007ff11d2e953e in __gthread_once (__once=0x1bdd280, __func=0x7ff1269148c0 <__once_proxy>) at /opt/rh/devtoolset-9/root/usr/include/c++/9/x86_64-redhat-linux/bits/gthr-default.h:700
#62 0x00007ff11d2f276a in std::call_once<void (JEventSource::)(), JEventSource> (__once=..., __f=@0x7fff4ba066f0: &virtual JEventSource::Open()) at /opt/rh/devtoolset-9/root/usr/include/c++/9/mutex:683
#63 0x00007ff11d2f19b6 in JEventSource::DoInitialize (this=0x1bdd230) at /home/davidl/work/2022.07.11.EICrecon/EICTOP/JANA/v2.0.7/include/JANA/JEventSource.h:112
#64 0x000000000045dc2a in JEventSourceArrow::initialize() ()
#65 0x0000000000454b3c in JArrow::run() ()
#66 0x0000000000453c70 in JArrowTopology::run(int) ()
#67 0x000000000044c7ae in JArrowProcessingController::run(unsigned long) ()
#68 0x000000000042b6ad in JApplication::Run(bool) ()
#69 0x00000000004183ad in jana::Execute (app=0x1b0dea0, options=...) at /home/davidl/work/2022.09.02.EICrecon/EICrecon/src/utilities/eicrecon/eicrecon_cli.cpp:381
#70 0x00000000004125b5 in main (narg=3, argv=0x7fff4ba08428) at /home/davidl/work/2022.09.02.EICrecon/EICrecon/src/utilities/eicrecon/eicrecon.cc:54
(gdb)

@wdconinc
Copy link
Contributor Author

wdconinc commented Nov 7, 2022

Negatory.

Unless you mean that pressing Ctrl-C during geometry reading is intentionally showing Exiting gracefully... and intentionally still continue to run reconstruction on events:

[2022-11-06 22:01:43.542] [JEventProcessorPODIO] [info] Writing collection 'ZDCEcalRecHits'
[2022-11-06 22:01:43.542] [JEventProcessorPODIO] [info] Writing collection 'ZDCEcalTruthClusters'
[2022-11-06 22:01:43.542] [JEventProcessorPODIO] [info] Writing collection 'trackerHits'
[2022-11-06 22:01:43.543] [B0ECalIslandProtoClusters] [info] Clustering uses dimScaledLocalDistXY with distances <= [1.8,1.8]
PersistencyIO    INFO  +++ Set Streamer to dd4hep::OpaqueDataBlock
Info in <TGeoManager::TGeoManager>: Geometry default, Detector Geometry created
Info in <TGeoNavigator::BuildCache>: --- Maximum geometry depth set to 100
[INFO] Loading DD4hep geometry from 1 files
[INFO]   - loading geometry file:  '/opt/detector/epic-nightly/share/epic/epic.xml' (patience ....)
Info in <TGeoManager::SetTopVolume>: Top volume is world_volume. Master volume is world_volume
^C[FATAL] Exiting gracefully...
Info in <TGeoManager::CheckGeometry>: Fixing runtime shapes...
Info in <TGeoManager::CheckGeometry>: ...Nothing to fix
Info in <TGeoManager::CloseGeometry>: Counting nodes...
Info in <TGeoManager::Voxelize>: Voxelizing...
Info in <TGeoManager::CloseGeometry>: Building cache...
Info in <TGeoManager::CountLevels>: max level = 7, max placements = 26163
Info in <TGeoManager::CloseGeometry>: 530448 nodes/ 18123 volume UID's in Detector Geometry
Info in <TGeoManager::CloseGeometry>: ----------------modeler ready----------------
[INFO] Geometry successfully loaded.
[2022-11-06 22:02:01.106] [B0ECalRecHits] [info] Find sector field sector, index = 1
[2022-11-06 22:02:01.111] [JEventProcessorPODIO] [warning] Exception adding PODIO type 'edm4eic::Cluster:EcalBarrelImagingClusters': Could not find JFactoryT<edm4hep::SimCalorimeterHit> with tag=EcalBarrelImagingHits.
[2022-11-06 22:02:01.113] [JEventProcessorPODIO] [warning] Exception adding PODIO type 'edm4eic::Cluster:EcalBarrelImagingMergedClusters': Could not find JFactoryT<edm4hep::SimCalorimeterHit> with tag=EcalBarrelScFiHits.
[2022-11-06 22:02:01.115] [JEventProcessorPODIO] [warning] Exception adding PODIO type 'edm4hep::RawCalorimeterHit:EcalBarrelImagingRawHits': Could not find JFactoryT<edm4hep::SimCalorimeterHit> with tag=EcalBarrelImagingHits.
...
[INFO] Status: 1 events processed  0.0 mHz (0.0 mHz avg)
[INFO] Merging threads ...
[INFO] Event processing ended.
[INFO] JArrowProcessingController: Final Report
  Thread team size [count]:    1
  Total uptime [s]:            22.06
  Uptime delta [s]:            21.56
  Completed events [count]:    1
  Inst throughput [Hz]:        0.0464
  Avg throughput [Hz]:         0.0453
  Sequential bottleneck [Hz]:  469
  Parallel bottleneck [Hz]:    0.0453
  Efficiency [0..1]:           1

@wdconinc
Copy link
Contributor Author

wdconinc commented Nov 7, 2022

Even pressing Ctrl-C later, during Acts initialization, still completes a full event.

[2022-11-06 22:05:10.916] [acts] [info] Acts GENERAL log level is set to info (2)
[2022-11-06 22:05:10.916] [acts_init] [info] Acts INIT log level is set to info (2)
[2022-11-06 22:05:10.916] [acts_init] [info] loading materials map from file: 'calibrations/materials-map.cbor'
[INFO] Status: 0 events processed  0.0 Hz (0.0 Hz avg)
^C[FATAL] Exiting gracefully...
[2022-11-06 22:05:11.439] [acts_init] [info] Converting DD4Hep geometry to ACTS...
22:05:11    DD4hepConver   INFO      Translating DD4hep geometry into Acts geometry
22:05:11    DD4hepConver   INFO      Translating DD4hep sub detector: acts_beampipe_central
22:05:11    DD4hepConver   INFO      Translating DD4hep sub detector: VertexBarrelSubAssembly
22:05:11    DD4hepConver   INFO      Translating DD4hep sub detector: InnerSiTrackerSubAssembly
22:05:11    DD4hepConver   INFO      Translating DD4hep sub detector: MiddleSiTrackerSubAssembly
22:05:11    DD4hepConver   INFO      Translating DD4hep sub detector: OuterSiTrackerSubAssembly
22:05:11    DD4hepConver   INFO      Translating DD4hep sub detector: InnerMPGDSubAssembly
22:05:11    DD4hepConver   INFO      Translating DD4hep sub detector: BarrelTOFSubAssembly
22:05:11    DD4hepConver   INFO      Translating DD4hep sub detector: EndcapTOFSubSubAssembly
[2022-11-06 22:05:14.014] [acts_init] [info] DD4Hep geometry converted!
[2022-11-06 22:05:14.014] [acts_init] [info] Checking surfaces...
[2022-11-06 22:05:14.049] [acts_init] [info] Loading magnetic field...
[2022-11-06 22:05:14.049] [acts_init] [info] ActsGeometryProvider initialization complete
[2022-11-06 22:05:14.050] [Digi:SmearedFarForwardParticles] [warning] No incoming ion beam; using 100 GeV ion beam energy.
[2022-11-06 22:05:14.050] [ZDCEcalIslandProtoClusters] [info] Clustering uses dimScaledLocalDistXY with distances <= [5,5]
[2022-11-06 22:05:14.050] [ZDCEcalRecHits] [warning] Failed to find field index for ZDCEcalHits.
[2022-11-06 22:05:14.050] [ZDCEcalRecHits] [warning]  -- looking for sector field "sector".
[2022-11-06 22:05:14.050] [ZDCEcalRecHits] [warning] Available fields: "system", "module", "fiber_x", "fiber_y", 
[2022-11-06 22:05:14.050] [ZDCEcalRecHits] [warning] n.b. The local position, sector id and layer id will not be correct for this.
[2022-11-06 22:05:14.050] [ZDCEcalRecHits] [warning] however, the position, energy, and time values should still be good.
[INFO] JArrow: Finalizing JEventSource 'ddsim.edm4hep.root' (JEventSourcePODIOsimple)
[INFO] JArrow: Finalizing JEventProcessor 'JEventProcessorPODIO'
[INFO] Status: 1 events processed  0.2 Hz (0.0 mHz avg)
[INFO] Merging threads ...
[INFO] Event processing ended.

@wdconinc
Copy link
Contributor Author

wdconinc commented Nov 7, 2022

At least in those cases there are no zombies.

@faustus123
Copy link
Contributor

Yes, that is exactly what I meant. Both geometry and acts setup are being done during the processing of the first event. Ctl-C there allows that first event to complete so the event sources, sinks, etc. can all run their Finish methods to cleanup. This is what "gracefully" is intended to convey. If you want an immediate stop, you'll have to send a SIGKILL which we cannot catch.

Fixing the deadlock problem should allow this to work as intended and closer to your expectation. This does mean a few extra seconds if it occurs during startup. It may also take a couple of seconds for all threads to drain if running on a many-core systems with event processing on the order of 1 sec/thread. However, this ability to signal early stopping with and intact root file by hitting Ctl-C has proven so useful in the past it is hard to imagine not having it. You will become a fan.

@wdconinc
Copy link
Contributor Author

wdconinc commented Nov 7, 2022

You will become a fan.

I'm familiar with the concept of trapping signals and cleaning up resources. I just think this usually is a bit faster than the 10 seconds it is here. I don't think users will expect it to complete the current event.

@faustus123
Copy link
Contributor

I'm not so sure. This is what ddsim does. If you hit Ctl-C while it is reading in the geometry, it continues to read it and doesn't exit until it hits RUNNER.run() after the geometry has been completely read in. This takes several seconds as well.

@wdconinc
Copy link
Contributor Author

wdconinc commented Nov 7, 2022

Again, I mean this also AFTER geometry has been read in. On my system (2 year old laptop, cvmfs eic-shell) it takes 5 seconds from Ctrl-C to command line AFTER the geometry read-in has been completed.

Anyway, that's clearly less important than not having zombies and deadlocks.

@faustus123
Copy link
Contributor

Well, we could address this specifically in EICrecon by adding lines just before long initializations are started like:

if( GetApplication()->IsQuitting() ) return;

Doing this just before initializing acts, could save a little time.

@nathanwbrei
Copy link
Contributor

Hi everyone! Sorry to jump in to this thread late. Some thoughts:

  • I'll prioritize fixing the deadlock and the zombies at the JANA level.

  • The expected behavior on JANA is that one Ctrl-C does a graceful shutdown that finishes all events currently in flight, two Ctrl-C's attempts to close the JEventProcessors/output files without waiting for all events in flight to finish, and three Ctrl-C's does an immediate exit. There is currently a known issue where the segfault handler attempts a graceful shutdown rather than hard exit()ing, which leads to a different deadlock, which will also be fixed in the next release. (This issue can mess up the handler for pressing Ctrl-C twice because right now we aren't attempting to kill the worker threads before closing the ROOT file.)

  • Offhand I don't think there's anything we should change about this behavior, other than to make it not broken. If you are running JANA interactively, there's nothing stopping you from pressing Ctrl-C two extra times if it is taking too long to shut down. If you are running on a cluster, just send it a SIGKILL (or SIGTERM, or SIGQUIT) instead. Maybe there's an argument to be made for getting rid of the second SIGINT handler, so that pressing Ctrl-C once does the graceful exit and twice does the hard exit.

@nathanwbrei
Copy link
Contributor

The zombie lscpu child process is fixed and will make it into the next JANA release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants