Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Memory access related errors] Error in `./bin/xspecfem3D': double free or corruption (!prev) #1674

Open
planetarianPKU opened this issue Jan 24, 2024 · 4 comments

Comments

@planetarianPKU
Copy link

Dear SPECFEM3D Team,

write at the front:

I think the error may cause by my cluster not the code, because I have been using SPECFEM3D normally on this cluster for 3years, and recently compiled my code normally on another cluster. But since this thing is so bizarre, I'll record it. I also don’t understand why it can run normally in DEBUG mode

I have been using SPECFEM3D on large-scale clusters for 3 years and am familiar with xspecfem3D forward simulation code. Recently, for some reasons, I needed to recompile my SPECFEM3D version 2018. I was surprised to find that when I recompiled, the xspecfem3D program would report a memory error as follows. This almost never happened before, at least when the code was not changed by myself, I change the division of mesh to try run and get:

*** Error in ./bin/xspecfem3D': free(): invalid next size (normal): 0x0000000001ed9380 *** *** Error in ./bin/xspecfem3D': double free or corruption (!prev): 0x0000000002576ae0 ***
======= Backtrace: =========
======= Backtrace: =========
/lib64/libc.so.6(+0x81299)[0x2b3dc2198299]
./bin/xspecfem3D[0x6a2c60]
./bin/xspecfem3D[0x640609]
./bin/xspecfem3D[0x63f4c1]
./bin/xspecfem3D[0x5bad4c]
./bin/xspecfem3D[0x5ca605]
/lib64/libc.so.6(+0x81299)[0x2b01c5475299]
./bin/xspecfem3D[0x6a2c60]
./bin/xspecfem3D[0x640609]
./bin/xspecfem3D[0x6406c3]
./bin/xspecfem3D[0x6406c3]
./bin/xspecfem3D[0x6406c3]
./bin/xspecfem3D[0x6406c3]
./bin/xspecfem3D[0x6406c3]
./bin/xspecfem3D[0x63f4c1]
./bin/xspecfem3D[0x5bad4c]
./bin/xspecfem3D[0x5ca605]
./bin/xspecfem3D[0x406062]
./bin/xspecfem3D[0x406062]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b3dc2139555]
./bin/xspecfem3D[0x405f69]

or like this:

*** Error in `./bin/xspecfem3D': double free or corruption (!prev): 0x00000000014cbcc0 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x81299)[0x2ab901558299]
./bin/xspecfem3D[0x69edd0]
./bin/xspecfem3D[0x63c779]
./bin/xspecfem3D[0x63b631]
./bin/xspecfem3D[0x5ba6bc]
./bin/xspecfem3D[0x5c9890]
./bin/xspecfem3D[0x406062]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x2ab9014f9555]
./bin/xspecfem3D[0x405f69]
======= Memory map: ========
00400000-007c3000 r-xp 00000000 00:28 12876224895012393020 /home/sunjn/software/graduation/2024/origin/specfem3d-master/EXAMPLES/meshfem3D_examples/test_C002/bin/xspecfem3D
009c2000-009c4000 r--p 003c2000 00:28 12876224895012393020 /home/sunjn/software/graduation/2024/origin/specfem3d-master/EXAMPLES/meshfem3D_examples/test_C002/bin/xspecfem3D
009c4000-009f1000 rw-p 003c4000 00:28 12876224895012393020 /home/sunjn/software/graduation/2024/origin/specfem3d-master/EXAMPLES/meshfem3D_examples/test_C002/bin/xspecfem3D
009f1000-00a93000 rw-p 00000000 00:00 0
0122a000-014e0000 rw-p 00000000 00:00 0 [heap]
2ab8ff26b000-2ab8ff28d000 r-xp 00000000 fd:00 1834 /usr/lib64/ld-2.17.so
2ab8ff28d000-2ab8ff297000 rw-p 00000000 00:00 0
2ab8ff297000-2ab8ff298000 rw-s 003f0000 00:05 43018 /dev/infiniband/uverbs4
2ab8ff298000-2ab8ff299000 rw-s 003f0000 00:05 43015 /dev/infiniband/uverbs1
2ab8ff299000-2ab8ff29a000 rw-s 003f0000 00:05 43016 /dev/infiniband/uverbs2

or like this:

xspecfem3D:73154 terminated with signal 11 at PC=63bdc9 SP=7ffdad54a840. Backtrace:

xspecfem3D:73152 terminated with signal 11 at PC=63bdc9 SP=7fff54347740. Backtrace:
*** Error in `./bin/xspecfem3D': double free or corruption (!prev): 0x00000000012ec4c0 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x81299)[0x2ab904fcc299]
./bin/xspecfem3D[0x69edd0]
./bin/xspecfem3D[0x63c779]
./bin/xspecfem3D[0x63b631]
./bin/xspecfem3D[0x5ba6bc]
./bin/xspecfem3D[0x5c9890]
./bin/xspecfem3D[0x406062]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x2ab904f6d555]
./bin/xspecfem3D[0x405f69]
======= Memory map: ========
./bin/xspecfem3D[0x63bdc9]
./bin/xspecfem3D[0x63c8c2]
./bin/xspecfem3D[0x63c833]
./bin/xspecfem3D[0x63c833]
./bin/xspecfem3D[0x63c833]
./bin/xspecfem3D[0x63c833]
./bin/xspecfem3D[0x63c833]
./bin/xspecfem3D[0x63b631]
./bin/xspecfem3D[0x5ba6bc]

To check why this happens, I first test the 2018 ver and 2023 ver source code, the result is similar. Then I write a lot of print information before and after each function in /src/specfem3D/xspecfem3D.f90, to moniter where did it crush.
And flinally i found that the program in processors always crushed at

**specfem3D.F90: call setup_sources_receivers().

setup_sources_receivers.F90: call setup_search_kdtree()
/shared/search_kdtree.f90: call create_kdtree(npoints,points_data,points_index,kdtree, &
depth,1,npoints,numnodes,maxdepth)
which create_kdtree is a recursive-defined function.
recursive subroutine create_kdtree(npoints,points_data,points_index,node, &
depth,ibound_lower,ibound_upper,numnodes,maxdepth)**

At this time, I'm very confused because the code that was not changed are also report that erros. I thought that may because in this 3 years the environment of my cluster has changes. After struggling in vain, I made one final attempt, that is add a DEBUGFLAG when compiling:

configure:
./configure FC=/home/opt/intel2020u4/compilers_and_libraries_2020.4.304/linux/bin/intel64/ifort CC=icc MPIFC=mpiifort --with-mpi MPI_INC=/home/opt/intel2020u4/compilers_and_libraries_2020.4.304/linux/mpi/intel64/include
in Makefile:
DEBUG_COUPLED_FLAG = -check all -debug -g -fp-stack-check -traceback -ftrapuv -xHost -assume byterecl -assume buffered_io -mcmodel=medium -shared-intel

and the The program miraculously returned to normal and I still don't know why. Then I rapidly change some codes to output snapshots that I want and it works. I think the error may cause by my cluster not the code, because I have been using SPECFEM3D normally on this cluster for 3years, and recently compiled my code normally on another cluster. I also don’t understand why it can run normally in DEBUG mode

Now, I continue to happily use SPECFEM3D --- in DEBUG mode. I write this to share the my experience recently when using SPECFEM3D on my cluster.

Regards

Jingnan Sun

[email protected]

@danielpeter
Copy link
Contributor

danielpeter commented Jan 24, 2024 via email

@planetarianPKU
Copy link
Author

Dear Doc. Danielpeter:

Thank you for your very useful advice!

Following your advices, I compiled and ran the devel version you updated yesterday and it works perfectly with no error, that's good.

Then I recompiled my 2018 ver code again with no DEBUG_FLAG, and it show errors again. And I check my sbatch bash, I did run the mpirun:

mpirun -np $NPROC ./bin/xmeshfem3D
mpirun -np $NPROC ./bin/xgenerate_databases
mpirun -np $NPROC ./bin/xspecfem3D

and there are mpi filles from proc000000_XX to proc000003_XX in the DATABASES_MPI, and I wrote a lot of myrank prints in specfem3d.f90 to monitor the progress of each processes. So I'm sure I did run the mpi program.

Anyway I will use the 2018 verr in DEBUG mode and maybe change my work to new devel version in the future.

Jingnan

@danielpeter
Copy link
Contributor

no Doc please, just daniel... - we can do a doctor's like surgical operation described below if you like :)

and thanks for the feedback. there can indeed be a problem in the old search_kdtree.f90 code for Intel compilers. as stated in the source code file:

 ! note: compiling with intel ifort version 18.0.1/19.1.0 and optimizations like -xHost -O2 or -xHost -O3 flags
  !       can lead to issues with the deallocate(workindex) statement below:
  !         *** Error in `./bin/xspecfem3D': double free or corruption (!prev): 0x00000000024f1610 ***
  !
  !       this might be due to a more aggressive optimization which leads to a change of the instruction set
  !       and the memory being free twice.
  !       a way to avoid this is by removing -xHost from FLAGS_CHECK = .. in Makefile
  !       or to use a pointer array instead of an allocatable array
  !
  ! integer,dimension(:),allocatable :: workindex

it seems you're using an Intel compiler, so as it says, you can either try

  • to compile without the -xHost flag or
  • do a surgical operation and replace your old version file src/shared/search_kdtree.f90 with the new one from the recent devel branch where this has been fixed

happy coding :)

@planetarianPKU
Copy link
Author

Dear daniel:

Following your advice, I copy the search_kdtree.f90 from 2024 devel version to 2018 master version, and still holds the -xHost flag when compiling, and it works.

FLAGS_CHECK = -xHost -fpe0 -ftz -assume buffered_io -assume byterecl -align sequence -std03 -diag-disable 6477 -implicitnone -gen-interfaces -warn all -O3 -check nobounds

Then I compile the unchanged 2018 version without the -xHost flag, it works too.

FLAGS_CHECK = -fpe0 -ftz -assume buffered_io -assume byterecl -align sequence -std03 -diag-disable 6477 -implicitnone -gen-interfaces -warn all -O3 -check nobounds
Codes from both methods are equally fast, running in one-fifth the time of running in DEBUG mode and the snapshots and waveforms of receivers are correct after checking. This really saves most of the time-comsuming and solves a confusion that had been bothering me for a long time —— ( there is no problem with the source code, and there seems no problem with my tiny modifications, so why do I get an error? Wait, why is the source code also reporting an error? When did I change it?). Anyway that problem is fully solved now.

you are very gorgeous and I truly appreciate your help.

Jingnan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants