Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Debugging Octo-Tiger on Grace Hopper #496

Open
diehlpk opened this issue Sep 3, 2024 · 7 comments
Open

Debugging Octo-Tiger on Grace Hopper #496

diehlpk opened this issue Sep 3, 2024 · 7 comments
Assignees

Comments

@diehlpk
Copy link
Member

diehlpk commented Sep 3, 2024

I just pushed a branch called verbose_debug. To enable the debugging output, set --verbose=1, to disable, --verbose=0. I've attach an example of the output. It gives comments at the beginning and end of functions, along with the start time and the time elapsed during the execution of the function. When the comment has something like "(from root)" this means the code is within a function that executes for each node, and only the root node is emitting output.

@diehlpk
Copy link
Member Author

diehlpk commented Sep 3, 2024

@dmarce1 I tried to compile the new branch and I get the following error

2 errors found in build log:
     123    -- Octo-Tiger will use Kokkos Serial Execution Space for (Kokkos CPU) Hydro kernels!
     124    INFO Building with fp_contract=off
     125    -- Octo-Tiger max nf: 15
     126    -- Octo-Tiger minimal allowed theta: 0.34
     127    INFO Used Octo-Tiger commit: 02cf56d9bc2b4852022886f5cff6a39bb7438a07
     128    -- Configuring done
  >> 129    CMake Error at /users/diehlpk/spack/opt/spack/linux-sles15-neoverse_v2/gcc-12.3.0/hpx-1.9.1-4e54quutjtm4nz
            4y447r5kanti3odvn6/lib64/cmake/HPX/HPX_AddLibrary.cmake:235 (add_library):
     130      Cannot find source file:
     131    
     132        octotiger/verbose.hpp
     133    
     134      Tried extensions .c .C .c++ .cc .cpp .cxx .cu .mpp .m .M .mm .h .hh .h++
     135      .hm .hpp .hxx .in .txx .f .F .for .f77 .f90 .f95 .f03 .ispc
     136    Call Stack (most recent call first):
     137      CMakeLists.txt:349 (add_hpx_library)
     138    
     139    
  >> 140    CMake Error at /users/diehlpk/spack/opt/spack/linux-sles15-neoverse_v2/gcc-12.3.0/hpx-1.9.1-4e54quutjtm4nz
            4y447r5kanti3odvn6/lib64/cmake/HPX/HPX_AddLibrary.cmake:235 (add_library):
     141      No SOURCES given to target: octolib
     142    Call Stack (most recent call first):
     143      CMakeLists.txt:349 (add_hpx_library)
     144    
     145    
     146    CMake Generate step failed.  Build files cannot be regenerated correctly.

@diehlpk
Copy link
Member Author

diehlpk commented Sep 3, 2024

cc @G-071 and @JiakunYan

@diehlpk
Copy link
Member Author

diehlpk commented Sep 19, 2024

The code hangs here

New Omega = 9.687093e-01
t=21 END  : DWD step (file: /users/diehlpk/compile/octotiger/src/node_server_actions_3.cpp, line: 396, function: execute_solver) (2.939855e+00 s elapsed)
TS 16:: t: 2.423595e+02, dt: 1.655967e-03, time_elapsed: 3.066656e+00, rotational_time: 2.347759e+02, x: 1.492980e+00, y: -4.062294e+00, z: -3.587781e-01, a: 3.727446e+00, ur: 2.053696e-06, ul: 2.037199e-06, vr: 6.826120e-01, vl: 6.800910e-01, dim: 0, ngrids: 8393, leafs: 7344, amr_boundaries: 5960
t=21 BEGIN: regrid (file: /users/diehlpk/compile/octotiger/src/node_server_actions_3.cpp, line: 458, function: execute_solver)
-----------------------------------------------
t=21 BEGIN: check for refinement (file: /users/diehlpk/compile/octotiger/src/node_server_actions_1.cpp, line: 251, function: regrid)
t=21 END  : check for refinement (file: /users/diehlpk/compile/octotiger/src/node_server_actions_1.cpp, line: 251, function: regrid) (3.230400e-02 s elapsed)
t=21 BEGIN: regrid (file: /users/diehlpk/compile/octotiger/src/node_server_actions_1.cpp, line: 259, function: regrid)
t=21 BEGIN: gather (file: /users/diehlpk/compile/octotiger/src/node_server_actions_1.cpp, line: 261, function: regrid)
          (rebalancing 8489 nodes with 7428 leaves)
t=21 END  : gather (file: /users/diehlpk/compile/octotiger/src/node_server_actions_1.cpp, line: 261, function: regrid) (4.129000e-03 s elapsed)
t=21 BEGIN: scatter (file: /users/diehlpk/compile/octotiger/src/node_server_actions_1.cpp, line: 266, function: regrid)
t=21 END  : scatter (file: /users/diehlpk/compile/octotiger/src/node_server_actions_1.cpp, line: 266, function: regrid) (1.547000e-02 s elapsed)
t=21 BEGIN: form tree connections (file: /users/diehlpk/compile/octotiger/src/node_server_actions_1.cpp, line: 271, function: regrid)
          (6072 amr boundaries)
t=21 END  : form tree connections (file: /users/diehlpk/compile/octotiger/src/node_server_actions_1.cpp, line: 271, function: regrid) (1.114730e-01 s elapsed)
t=21 BEGIN: solve gravity (file: /users/diehlpk/compile/octotiger/src/node_server_actions_1.cpp, line: 276, function: regrid)
t=21 BEGIN: (root node) computing FMM (file: /users/diehlpk/compile/octotiger/src/node_server_actions_1.cpp, line: 329, function: solve_gravity)
t=22 END  : (root node) computing FMM (file: /users/diehlpk/compile/octotiger/src/node_server_actions_1.cpp, line: 329, function: solve_gravity) (3.010200e-02 s elapsed)
t=22 END  : solve gravity (file: /users/diehlpk/compile/octotiger/src/node_server_actions_1.cpp, line: 276, function: regrid) (7.064400e-02 s elapsed)
t=22 END  : regrid (file: /users/diehlpk/compile/octotiger/src/node_server_actions_1.cpp, line: 259, function: regrid) (2.020480e-01 s elapsed)
t=22 END  : regrid (file: /users/diehlpk/compile/octotiger/src/node_server_actions_3.cpp, line: 458, function: execute_solver) (2.345370e-01 s elapsed)
t=22 END  : main execution loop iteration (file: /users/diehlpk/compile/octotiger/src/node_server_actions_3.cpp, line: 363, function: execute_solver) (3.301262e+00 s elapsed)
t=22 BEGIN: main execution loop iteration (file: /users/diehlpk/compile/octotiger/src/node_server_actions_3.cpp, line: 363, function: execute_solver)
t=22 BEGIN: DWD step (file: /users/diehlpk/compile/octotiger/src/node_server_actions_3.cpp, line: 396, function: execute_solver)

@dmarce1
Copy link
Member

dmarce1 commented Sep 20, 2024 via email

@diehlpk
Copy link
Member Author

diehlpk commented Sep 25, 2024

Here is the new output

diagnostics...
New Omega = 9.687093e-01
t=25 END  : DWD step (file: /users/diehlpk/compile/octotiger/src/node_server_actions_3.cpp, line: 396, function: execute_solver) (2.890631e+00 s elapsed)
TS 16:: t: 2.423595e+02, dt: 1.655967e-03, time_elapsed: 3.017211e+00, rotational_time: 2.347759e+02, x: 1.492980e+00, y: -4.062294e+00, z: -3.587781e-01, a: 3.727446e+00, ur: 2.053696e-06, ul: 2.037199e-06, vr: 6.826120e-01, vl: 6.800910e-01, dim: 0, ngrids: 8393, leafs: 7344, amr_boundaries: 5960
t=25 BEGIN: regrid (file: /users/diehlpk/compile/octotiger/src/node_server_actions_3.cpp, line: 458, function: execute_solver)
-----------------------------------------------
t=25 BEGIN: check for refinement (file: /users/diehlpk/compile/octotiger/src/node_server_actions_1.cpp, line: 251, function: regrid)
t=25 END  : check for refinement (file: /users/diehlpk/compile/octotiger/src/node_server_actions_1.cpp, line: 251, function: regrid) (3.727400e-02 s elapsed)
t=25 BEGIN: regrid (file: /users/diehlpk/compile/octotiger/src/node_server_actions_1.cpp, line: 259, function: regrid)
t=25 BEGIN: gather (file: /users/diehlpk/compile/octotiger/src/node_server_actions_1.cpp, line: 261, function: regrid)
          (rebalancing 8489 nodes with 7428 leaves)
t=25 END  : gather (file: /users/diehlpk/compile/octotiger/src/node_server_actions_1.cpp, line: 261, function: regrid) (1.043100e-02 s elapsed)
t=25 BEGIN: scatter (file: /users/diehlpk/compile/octotiger/src/node_server_actions_1.cpp, line: 266, function: regrid)
t=25 END  : scatter (file: /users/diehlpk/compile/octotiger/src/node_server_actions_1.cpp, line: 266, function: regrid) (1.588200e-02 s elapsed)
t=25 BEGIN: form tree connections (file: /users/diehlpk/compile/octotiger/src/node_server_actions_1.cpp, line: 271, function: regrid)
          (6072 amr boundaries)
t=26 END  : form tree connections (file: /users/diehlpk/compile/octotiger/src/node_server_actions_1.cpp, line: 271, function: regrid) (1.025130e-01 s elapsed)
t=26 BEGIN: solve gravity (file: /users/diehlpk/compile/octotiger/src/node_server_actions_1.cpp, line: 276, function: regrid)
t=26 BEGIN: (root node) computing FMM (file: /users/diehlpk/compile/octotiger/src/node_server_actions_1.cpp, line: 329, function: solve_gravity)
t=26 END  : (root node) computing FMM (file: /users/diehlpk/compile/octotiger/src/node_server_actions_1.cpp, line: 329, function: solve_gravity) (4.937500e-02 s elapsed)
t=26 END  : solve gravity (file: /users/diehlpk/compile/octotiger/src/node_server_actions_1.cpp, line: 276, function: regrid) (7.951700e-02 s elapsed)
t=26 END  : regrid (file: /users/diehlpk/compile/octotiger/src/node_server_actions_1.cpp, line: 259, function: regrid) (2.084010e-01 s elapsed)
t=26 END  : regrid (file: /users/diehlpk/compile/octotiger/src/node_server_actions_3.cpp, line: 458, function: execute_solver) (2.457240e-01 s elapsed)
t=26 END  : main execution loop iteration (file: /users/diehlpk/compile/octotiger/src/node_server_actions_3.cpp, line: 363, function: execute_solver) (3.262951e+00 s elapsed)
t=26 BEGIN: main execution loop iteration (file: /users/diehlpk/compile/octotiger/src/node_server_actions_3.cpp, line: 363, function: execute_solver)

@dmarce1
Copy link
Member

dmarce1 commented Sep 26, 2024

This is running without SILO output enabled? I think the problem may be SILO related. If it is being run with SILO output can you please re-run it with disable_output=on?

@dmarce1
Copy link
Member

dmarce1 commented Sep 27, 2024

I think I have the bug narrowed down to diagnostics(), it is likely in this section of code in node_server_actions_2.cpp. I have added some more debug output which will hopefully let us narrow it down further.

EDIT: My bet is this is in all_hydro_bounds. If so, it may be hard to narrow it down using verbose debugging output past which kind of boundary exchange (there are three kinds, a) the restrict step which updates refined cells from their children, b) the decomp step which exchanges ghost cells between grids on the same level, and c) the AMR step which interpolates ghost cells at AMR boundaries).

diagnostics_t node_server::diagnostics(const diagnostics_t &diags) {
if (is_refined) {
auto rc = hpx::async(hpx::annotated_function(& {
return child_diagnostics(diags);
}, "diagnostics::return_child_diagnostics"));
all_hydro_bounds();
auto diags = GET(rc);
return diags;
} else {
all_hydro_bounds();
return local_diagnostics(diags);
}
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants