Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

signal SIGABRT in testing #33

Open
brtnfld opened this issue Mar 1, 2023 · 17 comments
Open

signal SIGABRT in testing #33

brtnfld opened this issue Mar 1, 2023 · 17 comments

Comments

@brtnfld
Copy link
Collaborator

brtnfld commented Mar 1, 2023

As I'm developing the FORTRAN async tests in HDF5, I'm seeing an issue with H5Aopen_async_f (backtrace below)

Sometimes the test fails and sometimes it does not. I'm running on 6 ranks.

It is basically doing:


    CALL h5fopen_async_f(filename, H5F_ACC_RDWR_F, file_id, es_id, hdferror, access_prp = fapl_id )
    CALL check("h5fopen_async_f",hdferror, total_error)

    f_ptr = C_LOC(exists0)
    CALL H5Aexists_async_f(file_id, attr_name, f_ptr, es_id, hdferror)
    CALL check("H5Aexists_async_f",hdferror, total_error)

    f_ptr = C_LOC(exists1)
    CALL H5Aexists_async_f(file_id, TRIM(attr_name)//"00", f_ptr, es_id, hdferror)
    CALL check("H5Aexists_async_f",hdferror, total_error)

    f_ptr = C_LOC(exists2)
    CALL H5Aexists_by_name_async_f(file_id, "/", attr_name, f_ptr, es_id, hdferror)
    CALL check("H5Aexists_by_name_async_f",hdferror, total_error)

    f_ptr = C_LOC(exists3)
    CALL H5Aexists_by_name_async_f(file_id, "/", TRIM(attr_name)//"00", f_ptr, es_id, hdferror)
    CALL check("H5Aexists_by_name_async_f",hdferror, total_error)

    CALL H5Aopen_async_f(file_id, attr_name, attr_id0, es_id, hdferror)  <--- fails here
    CALL check("H5Aopen_async_f", hdferror, total_error)


async_test: ../../src/H5Fint.c:631: H5F__get_objects_cb: Assertion `obj_ptr' failed.
async_test: ../../src/H5Fint.c:631: H5F__get_objects_cb: Assertion `obj_ptr' failed.

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:
#0  0x7f5c5f7734e2 in ???
#1  0x7f5c5f772675 in ???
#2  0x7f5c5e280d4f in ???
#3  0x7f5c5e280cbb in ???
#4  0x7f5c5e282354 in ???
#5  0x7f5c5e278cb9 in ???
#6  0x7f5c5e278d41 in ???
#7  0x7f5c60813ec7 in H5F__get_objects_cb
        at ../../src/H5Fint.c:631
#8  0x7f5c608e3555 in H5I__iterate_cb
        at ../../src/H5Iint.c:1526
#9  0x7f5c608e4eb2 in H5I_iterate
        at ../../src/H5Iint.c:1592
#10  0x7f5c60813dc0 in H5F__get_objects
        at ../../src/H5Fint.c:599
#11  0x7f5c608173a0 in H5F_get_obj_count
        at ../../src/H5Fint.c:475
#12  0x7f5c60920b98 in H5O__attr_find_opened_attr
        at ../../src/H5Oattribute.c:661
#13  0x7f5c60921f31 in H5O__attr_open_by_name
        at ../../src/H5Oattribute.c:473
#14  0x7f5c606fcacc in H5A__open
        at ../../src/H5Aint.c:535
#15  0x7f5c60b04368 in H5VL__native_attr_open
        at ../../src/H5VLnative_attr.c:154
#16  0x7f5c60ae073d in H5VL__attr_open
        at ../../src/H5VLcallback.c:1104
#17  0x7f5c60ae8827 in H5VLattr_open
        at ../../src/H5VLcallback.c:1175
#18  0x7f5c60d8527b in async_attr_open_fn
        at /home/brtnfld/work/vol-async/src/h5_async_vol.c:5675
#19  0x7f5c5c1bbc97 in ???
#20  0x7f5c5c1c1e98 in ???
#21  0xffffffffffffffff in ???

@brtnfld
Copy link
Collaborator Author

brtnfld commented Mar 1, 2023

The branch is https://github.com/brtnfld/hdf5/tree/ASYNC_F

To run the test:

#!/bin/bash
export ABT_DIR=$HOME/work/argobots/build/argobots/
export HDF5_DIR=$HOME/work/hdf5.brtnfld/build/hdf5

export LD_LIBRARY_PATH="$HDF5_DIR/lib64:$HOME/packages/szip-2.1.1/szip/lib64:$ABT_DIR/lib64:$LD_LIBRARY_PATH"
export HDF5_PLUGIN_PATH="$HOME/work/vol-async/build/lib"
export HDF5_VOL_CONNECTOR="async under_vol=0;under_info={}"

mpiexec -n 6 ./async_test

@brtnfld
Copy link
Collaborator Author

brtnfld commented Mar 1, 2023

I'm also getting hanging periodically with 8 ranks, but that is probably a separate issue:

#0  0x00007f5178db5890 in pool_pop_shared () from /home/brtnfld/work/argobots/build/argobots//lib64/libabt.so.1
#1  0x00007f5178db9aea in sched_run () from /home/brtnfld/work/argobots/build/argobots//lib64/libabt.so.1
#2  0x00007f5178da59b9 in thread_main_sched_func () from /home/brtnfld/work/argobots/build/argobots//lib64/libabt.so.1
#3  0x00007f5178db3c98 in ABTD_ythread_func_wrapper () from /home/brtnfld/work/argobots/build/argobots//lib64/libabt.so.1
#4  0x00007f5178da5469 in ABTD_ythread_context_func_wrapper () from /home/brtnfld/work/argobots/build/argobots//lib64/libabt.so.1
#5  0x0000000000000000 in ?? ()

@houjun
Copy link
Collaborator

houjun commented Mar 1, 2023

@brtnfld , can you add your full test code file here?

@brtnfld
Copy link
Collaborator Author

brtnfld commented Mar 1, 2023

async.F90.gz

It is also here:
https://github.com/brtnfld/hdf5/blob/ASYNC_F/fortran/testpar/async.F90

line 252 is the issue.

@houjun
Copy link
Collaborator

houjun commented Mar 1, 2023

Got it. Is there a C version of this test code?

@brtnfld
Copy link
Collaborator Author

brtnfld commented Mar 1, 2023

No, only Fortran.

@houjun
Copy link
Collaborator

houjun commented Mar 2, 2023

@brtnfld I'm able to reproduce the error.
After some debugging, this appears to be an old issue that I thought was resolved by HDF5 previously, but looks like it is either recurring or I was not testing the case very well before.

Basically, the issue comes from HDF5 trying to check whether an attribute is already opened in H5Oattribute.c and it seems to not like the future ID used by async vol when some are already created/opened and some are not. I found two workarounds that will not cause this error:

  1. Comment out lines 473-479 and 512 of H5Oattribute.c, this way HDF5 won't check for already opened attributes and things will be fine.
  2. Do "export HDF5_ASYNC_EXE_FCLOSE=1" before you run the test program, it will force async vol to not start executing the I/O operations until ESwait or Fclose are called, and the attribute ids are true future ids that have not been filled by async vol.

I forgot whether it was Neil or Jordan who looked at this issue before, can you check with them and see if there is a better solution?

Also, the test code seems to always segfault at the end:

nid00074:testpar$ srun -n 6 ./async_test
H5ES API tests                                                                          PASSED
H5A async API tests                                                                     PASSED
srun: error: nid00074: tasks 0-4: Segmentation fault
srun: launch/slurm: _step_signal: Terminating StepId=520944.42
srun: error: nid00074: task 5: Segmentation fault

@brtnfld
Copy link
Collaborator Author

brtnfld commented Mar 2, 2023

Thanks, I'll ask Jordan and Neil. I've not seen that segmentation fault before. Though I've only run it on a local desktop.

@brtnfld
Copy link
Collaborator Author

brtnfld commented Mar 2, 2023

BTW, even if I add an ESwait after the last exists, it still fails.

@houjun
Copy link
Collaborator

houjun commented Mar 6, 2023

@brtnfld does setting the environment variable work for you?
I don't think adding an ESwait would help, the issue seems to be from HDF5 checking the cached attribute.

@brtnfld
Copy link
Collaborator Author

brtnfld commented Mar 6, 2023

Yes, HDF5_ASYNC_EXE_FCLOSE fixes the issue.

@fortnern
Copy link
Member

fortnern commented Mar 7, 2023

@houjun could you share more details of your debugging? Looking through the future ID code I'm having trouble understanding how this could happen.

@houjun
Copy link
Collaborator

houjun commented Mar 8, 2023

Hi @fortnern , I have tried two things in my debugging that seem to fix this issue, the first is to comment out the code in HDF5 library (473-479 and 512 of H5Oattribute.c) so that HDF5 doesn't check whether an attribute is already opened. The second is in vol-async, I can delay the execution of all the attribute operations to a later time (e.g. at file close time).
My guess for the cause is there may be something wrong when the library is checking its cached attributes, it either doesn't like the future ID or there's some interference from vol-async. Although the interference seems unlikely as there can be only one thread performing HDF5 operations as threadsafty is turned on.

@houjun
Copy link
Collaborator

houjun commented May 25, 2023

@brtnfld @fortnern, can you check if the latest develop branch fixes all the Fortran test issues?

@brtnfld
Copy link
Collaborator Author

brtnfld commented May 25, 2023

It passes most of the time, but running it over and over, I can sometimes get it to fail with:


async_test: ../../src/H5Fint.c:631: H5F__get_objects_cb: Assertion `obj_ptr' failed.

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:
#0  0x7f48829964e2 in ???
#1  0x7f4882995675 in ???
#2  0x7f4880518cff in ???
#3  0x7f4880518c6b in ???
#4  0x7f488051a304 in ???
#5  0x7f4880510c69 in ???
#6  0x7f4880510cf1 in ???
#7  0x7f4883466b0a in H5F__get_objects_cb
        at ../../src/H5Fint.c:631
#8  0x7f4883536331 in H5I__iterate_cb
        at ../../src/H5Iint.c:1526
#9  0x7f4883537c5a in H5I_iterate
        at ../../src/H5Iint.c:1592
#10  0x7f4883466a03 in H5F__get_objects
        at ../../src/H5Fint.c:599
#11  0x7f4883469ff4 in H5F_get_obj_count
        at ../../src/H5Fint.c:475
#12  0x7f4883573ffd in H5O__attr_find_opened_attr
        at ../../src/H5Oattribute.c:661
#13  0x7f488357539b in H5O__attr_open_by_name
        at ../../src/H5Oattribute.c:473
#14  0x7f488334df3f in H5A__open
        at ../../src/H5Aint.c:535
#15  0x7f488375e753 in H5VL__native_attr_open
        at ../../src/H5VLnative_attr.c:158
#16  0x7f488373aaac in H5VL__attr_open
        at ../../src/H5VLcallback.c:1104
#17  0x7f4883742b96 in H5VLattr_open
        at ../../src/H5VLcallback.c:1175
#18  0x7f47f20c2737 in async_attr_open_fn
        at /home/brtnfld/work/vol-async/src/h5_async_vol.c:5772
#19  0x7f47f209bc97 in ???
#20  0x7f47f20a1e78 in ???
#21  0xffffffffffffffff in ???

@houjun
Copy link
Collaborator

houjun commented May 30, 2023

@brtnfld I think this is probably the same issue I mentioned earlier with the opened attribute, did you set "export HDF5_ASYNC_EXE_FCLOSE=1"?
In my previous debugging, the issue seems to come from searching the cached attributes in the library, my guess is the (filled) future id is not handled properly by the library, I'll see if I can find more this week.

@brtnfld
Copy link
Collaborator Author

brtnfld commented May 30, 2023

That was my mistake. It got removed in my editing of the run script. Using that, all the test pass.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants