-
Notifications
You must be signed in to change notification settings - Fork 229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Memory access related errors] Error in `./bin/xspecfem3D': double free or corruption (!prev) #1674
Comments
first, if you can, try to see if the devel branch version works - maybe this has been fixed already.
also, something I noticed recently is that the underlying updated MPI libraries crash when the code is complied with MPI support (`--with-mpi`), but then run as a serial executable `./bin/xspecfem3d` for single NPROC==1 simulations. if that is also your case, you will need to run it with the `mpirun` launcher around, like
```
mpirun -np 1 ./bin/xspecfem3d
```
and same for the other executables like `xmeshfem3d` and `xgenerate_databases`
|
Dear Doc. Danielpeter: Thank you for your very useful advice! Following your advices, I compiled and ran the devel version you updated yesterday and it works perfectly with no error, that's good. Then I recompiled my 2018 ver code again with no DEBUG_FLAG, and it show errors again. And I check my sbatch bash, I did run the mpirun: mpirun -np $NPROC ./bin/xmeshfem3D and there are mpi filles from proc000000_XX to proc000003_XX in the DATABASES_MPI, and I wrote a lot of myrank prints in specfem3d.f90 to monitor the progress of each processes. So I'm sure I did run the mpi program. Anyway I will use the 2018 verr in DEBUG mode and maybe change my work to new devel version in the future. Jingnan |
no Doc please, just daniel... - we can do a doctor's like surgical operation described below if you like :) and thanks for the feedback. there can indeed be a problem in the old search_kdtree.f90 code for Intel compilers. as stated in the source code file:
it seems you're using an Intel compiler, so as it says, you can either try
happy coding :) |
Dear daniel: Following your advice, I copy the search_kdtree.f90 from 2024 devel version to 2018 master version, and still holds the -xHost flag when compiling, and it works.
Then I compile the unchanged 2018 version without the -xHost flag, it works too.
you are very gorgeous and I truly appreciate your help. Jingnan |
Dear SPECFEM3D Team,
write at the front:
I think the error may cause by my cluster not the code, because I have been using SPECFEM3D normally on this cluster for 3years, and recently compiled my code normally on another cluster. But since this thing is so bizarre, I'll record it. I also don’t understand why it can run normally in DEBUG mode
I have been using SPECFEM3D on large-scale clusters for 3 years and am familiar with xspecfem3D forward simulation code. Recently, for some reasons, I needed to recompile my SPECFEM3D version 2018. I was surprised to find that when I recompiled, the xspecfem3D program would report a memory error as follows. This almost never happened before, at least when the code was not changed by myself, I change the division of mesh to try run and get:
*** Error in
./bin/xspecfem3D': free(): invalid next size (normal): 0x0000000001ed9380 *** *** Error in
./bin/xspecfem3D': double free or corruption (!prev): 0x0000000002576ae0 ***======= Backtrace: =========
======= Backtrace: =========
/lib64/libc.so.6(+0x81299)[0x2b3dc2198299]
./bin/xspecfem3D[0x6a2c60]
./bin/xspecfem3D[0x640609]
./bin/xspecfem3D[0x63f4c1]
./bin/xspecfem3D[0x5bad4c]
./bin/xspecfem3D[0x5ca605]
/lib64/libc.so.6(+0x81299)[0x2b01c5475299]
./bin/xspecfem3D[0x6a2c60]
./bin/xspecfem3D[0x640609]
./bin/xspecfem3D[0x6406c3]
./bin/xspecfem3D[0x6406c3]
./bin/xspecfem3D[0x6406c3]
./bin/xspecfem3D[0x6406c3]
./bin/xspecfem3D[0x6406c3]
./bin/xspecfem3D[0x63f4c1]
./bin/xspecfem3D[0x5bad4c]
./bin/xspecfem3D[0x5ca605]
./bin/xspecfem3D[0x406062]
./bin/xspecfem3D[0x406062]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b3dc2139555]
./bin/xspecfem3D[0x405f69]
or like this:
*** Error in `./bin/xspecfem3D': double free or corruption (!prev): 0x00000000014cbcc0 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x81299)[0x2ab901558299]
./bin/xspecfem3D[0x69edd0]
./bin/xspecfem3D[0x63c779]
./bin/xspecfem3D[0x63b631]
./bin/xspecfem3D[0x5ba6bc]
./bin/xspecfem3D[0x5c9890]
./bin/xspecfem3D[0x406062]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x2ab9014f9555]
./bin/xspecfem3D[0x405f69]
======= Memory map: ========
00400000-007c3000 r-xp 00000000 00:28 12876224895012393020 /home/sunjn/software/graduation/2024/origin/specfem3d-master/EXAMPLES/meshfem3D_examples/test_C002/bin/xspecfem3D
009c2000-009c4000 r--p 003c2000 00:28 12876224895012393020 /home/sunjn/software/graduation/2024/origin/specfem3d-master/EXAMPLES/meshfem3D_examples/test_C002/bin/xspecfem3D
009c4000-009f1000 rw-p 003c4000 00:28 12876224895012393020 /home/sunjn/software/graduation/2024/origin/specfem3d-master/EXAMPLES/meshfem3D_examples/test_C002/bin/xspecfem3D
009f1000-00a93000 rw-p 00000000 00:00 0
0122a000-014e0000 rw-p 00000000 00:00 0 [heap]
2ab8ff26b000-2ab8ff28d000 r-xp 00000000 fd:00 1834 /usr/lib64/ld-2.17.so
2ab8ff28d000-2ab8ff297000 rw-p 00000000 00:00 0
2ab8ff297000-2ab8ff298000 rw-s 003f0000 00:05 43018 /dev/infiniband/uverbs4
2ab8ff298000-2ab8ff299000 rw-s 003f0000 00:05 43015 /dev/infiniband/uverbs1
2ab8ff299000-2ab8ff29a000 rw-s 003f0000 00:05 43016 /dev/infiniband/uverbs2
or like this:
xspecfem3D:73154 terminated with signal 11 at PC=63bdc9 SP=7ffdad54a840. Backtrace:
xspecfem3D:73152 terminated with signal 11 at PC=63bdc9 SP=7fff54347740. Backtrace:
*** Error in `./bin/xspecfem3D': double free or corruption (!prev): 0x00000000012ec4c0 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x81299)[0x2ab904fcc299]
./bin/xspecfem3D[0x69edd0]
./bin/xspecfem3D[0x63c779]
./bin/xspecfem3D[0x63b631]
./bin/xspecfem3D[0x5ba6bc]
./bin/xspecfem3D[0x5c9890]
./bin/xspecfem3D[0x406062]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x2ab904f6d555]
./bin/xspecfem3D[0x405f69]
======= Memory map: ========
./bin/xspecfem3D[0x63bdc9]
./bin/xspecfem3D[0x63c8c2]
./bin/xspecfem3D[0x63c833]
./bin/xspecfem3D[0x63c833]
./bin/xspecfem3D[0x63c833]
./bin/xspecfem3D[0x63c833]
./bin/xspecfem3D[0x63c833]
./bin/xspecfem3D[0x63b631]
./bin/xspecfem3D[0x5ba6bc]
To check why this happens, I first test the 2018 ver and 2023 ver source code, the result is similar. Then I write a lot of print information before and after each function in /src/specfem3D/xspecfem3D.f90, to moniter where did it crush.
And flinally i found that the program in processors always crushed at
**specfem3D.F90: call setup_sources_receivers().
At this time, I'm very confused because the code that was not changed are also report that erros. I thought that may because in this 3 years the environment of my cluster has changes. After struggling in vain, I made one final attempt, that is add a DEBUGFLAG when compiling:
configure:
./configure FC=/home/opt/intel2020u4/compilers_and_libraries_2020.4.304/linux/bin/intel64/ifort CC=icc MPIFC=mpiifort --with-mpi MPI_INC=/home/opt/intel2020u4/compilers_and_libraries_2020.4.304/linux/mpi/intel64/include
in Makefile:
DEBUG_COUPLED_FLAG = -check all -debug -g -fp-stack-check -traceback -ftrapuv -xHost -assume byterecl -assume buffered_io -mcmodel=medium -shared-intel
and the The program miraculously returned to normal and I still don't know why. Then I rapidly change some codes to output snapshots that I want and it works. I think the error may cause by my cluster not the code, because I have been using SPECFEM3D normally on this cluster for 3years, and recently compiled my code normally on another cluster. I also don’t understand why it can run normally in DEBUG mode
Now, I continue to happily use SPECFEM3D --- in DEBUG mode. I write this to share the my experience recently when using SPECFEM3D on my cluster.
Regards
Jingnan Sun
[email protected]
The text was updated successfully, but these errors were encountered: