[Memory access related errors] Error in `./bin/xspecfem3D': double free or corruption (!prev) #1674

planetarianPKU · 2024-01-24T10:57:59Z

Dear SPECFEM3D Team,

write at the front:

I think the error may cause by my cluster not the code, because I have been using SPECFEM3D normally on this cluster for 3years, and recently compiled my code normally on another cluster. But since this thing is so bizarre, I'll record it. I also don’t understand why it can run normally in DEBUG mode

I have been using SPECFEM3D on large-scale clusters for 3 years and am familiar with xspecfem3D forward simulation code. Recently, for some reasons, I needed to recompile my SPECFEM3D version 2018. I was surprised to find that when I recompiled, the xspecfem3D program would report a memory error as follows. This almost never happened before, at least when the code was not changed by myself, I change the division of mesh to try run and get:

*** Error in ./bin/xspecfem3D': free(): invalid next size (normal): 0x0000000001ed9380 *** *** Error in ./bin/xspecfem3D': double free or corruption (!prev): 0x0000000002576ae0 ***
======= Backtrace: =========
======= Backtrace: =========
/lib64/libc.so.6(+0x81299)[0x2b3dc2198299]
./bin/xspecfem3D[0x6a2c60]
./bin/xspecfem3D[0x640609]
./bin/xspecfem3D[0x63f4c1]
./bin/xspecfem3D[0x5bad4c]
./bin/xspecfem3D[0x5ca605]
/lib64/libc.so.6(+0x81299)[0x2b01c5475299]
./bin/xspecfem3D[0x6a2c60]
./bin/xspecfem3D[0x640609]
./bin/xspecfem3D[0x6406c3]
./bin/xspecfem3D[0x6406c3]
./bin/xspecfem3D[0x6406c3]
./bin/xspecfem3D[0x6406c3]
./bin/xspecfem3D[0x6406c3]
./bin/xspecfem3D[0x63f4c1]
./bin/xspecfem3D[0x5bad4c]
./bin/xspecfem3D[0x5ca605]
./bin/xspecfem3D[0x406062]
./bin/xspecfem3D[0x406062]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b3dc2139555]
./bin/xspecfem3D[0x405f69]

or like this:

*** Error in `./bin/xspecfem3D': double free or corruption (!prev): 0x00000000014cbcc0 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x81299)[0x2ab901558299]
./bin/xspecfem3D[0x69edd0]
./bin/xspecfem3D[0x63c779]
./bin/xspecfem3D[0x63b631]
./bin/xspecfem3D[0x5ba6bc]
./bin/xspecfem3D[0x5c9890]
./bin/xspecfem3D[0x406062]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x2ab9014f9555]
./bin/xspecfem3D[0x405f69]
======= Memory map: ========
00400000-007c3000 r-xp 00000000 00:28 12876224895012393020 /home/sunjn/software/graduation/2024/origin/specfem3d-master/EXAMPLES/meshfem3D_examples/test_C002/bin/xspecfem3D
009c2000-009c4000 r--p 003c2000 00:28 12876224895012393020 /home/sunjn/software/graduation/2024/origin/specfem3d-master/EXAMPLES/meshfem3D_examples/test_C002/bin/xspecfem3D
009c4000-009f1000 rw-p 003c4000 00:28 12876224895012393020 /home/sunjn/software/graduation/2024/origin/specfem3d-master/EXAMPLES/meshfem3D_examples/test_C002/bin/xspecfem3D
009f1000-00a93000 rw-p 00000000 00:00 0
0122a000-014e0000 rw-p 00000000 00:00 0 [heap]
2ab8ff26b000-2ab8ff28d000 r-xp 00000000 fd:00 1834 /usr/lib64/ld-2.17.so
2ab8ff28d000-2ab8ff297000 rw-p 00000000 00:00 0
2ab8ff297000-2ab8ff298000 rw-s 003f0000 00:05 43018 /dev/infiniband/uverbs4
2ab8ff298000-2ab8ff299000 rw-s 003f0000 00:05 43015 /dev/infiniband/uverbs1
2ab8ff299000-2ab8ff29a000 rw-s 003f0000 00:05 43016 /dev/infiniband/uverbs2

or like this:

xspecfem3D:73154 terminated with signal 11 at PC=63bdc9 SP=7ffdad54a840. Backtrace:

xspecfem3D:73152 terminated with signal 11 at PC=63bdc9 SP=7fff54347740. Backtrace:
*** Error in `./bin/xspecfem3D': double free or corruption (!prev): 0x00000000012ec4c0 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x81299)[0x2ab904fcc299]
./bin/xspecfem3D[0x69edd0]
./bin/xspecfem3D[0x63c779]
./bin/xspecfem3D[0x63b631]
./bin/xspecfem3D[0x5ba6bc]
./bin/xspecfem3D[0x5c9890]
./bin/xspecfem3D[0x406062]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x2ab904f6d555]
./bin/xspecfem3D[0x405f69]
======= Memory map: ========
./bin/xspecfem3D[0x63bdc9]
./bin/xspecfem3D[0x63c8c2]
./bin/xspecfem3D[0x63c833]
./bin/xspecfem3D[0x63c833]
./bin/xspecfem3D[0x63c833]
./bin/xspecfem3D[0x63c833]
./bin/xspecfem3D[0x63c833]
./bin/xspecfem3D[0x63b631]
./bin/xspecfem3D[0x5ba6bc]

To check why this happens, I first test the 2018 ver and 2023 ver source code, the result is similar. Then I write a lot of print information before and after each function in /src/specfem3D/xspecfem3D.f90, to moniter where did it crush.
And flinally i found that the program in processors always crushed at

**specfem3D.F90: call setup_sources_receivers().

setup_sources_receivers.F90: call setup_search_kdtree()
/shared/search_kdtree.f90: call create_kdtree(npoints,points_data,points_index,kdtree, &
depth,1,npoints,numnodes,maxdepth)
which create_kdtree is a recursive-defined function.
recursive subroutine create_kdtree(npoints,points_data,points_index,node, &
depth,ibound_lower,ibound_upper,numnodes,maxdepth)**

At this time, I'm very confused because the code that was not changed are also report that erros. I thought that may because in this 3 years the environment of my cluster has changes. After struggling in vain, I made one final attempt, that is add a DEBUGFLAG when compiling:

configure:
./configure FC=/home/opt/intel2020u4/compilers_and_libraries_2020.4.304/linux/bin/intel64/ifort CC=icc MPIFC=mpiifort --with-mpi MPI_INC=/home/opt/intel2020u4/compilers_and_libraries_2020.4.304/linux/mpi/intel64/include
in Makefile:
DEBUG_COUPLED_FLAG = -check all -debug -g -fp-stack-check -traceback -ftrapuv -xHost -assume byterecl -assume buffered_io -mcmodel=medium -shared-intel

and the The program miraculously returned to normal and I still don't know why. Then I rapidly change some codes to output snapshots that I want and it works. I think the error may cause by my cluster not the code, because I have been using SPECFEM3D normally on this cluster for 3years, and recently compiled my code normally on another cluster. I also don’t understand why it can run normally in DEBUG mode

Now, I continue to happily use SPECFEM3D --- in DEBUG mode. I write this to share the my experience recently when using SPECFEM3D on my cluster.

Regards

Jingnan Sun

[email protected]

The text was updated successfully, but these errors were encountered:

danielpeter · 2024-01-24T14:24:31Z

first, if you can, try to see if the devel branch version works - maybe this has been fixed already. also, something I noticed recently is that the underlying updated MPI libraries crash when the code is complied with MPI support (`--with-mpi`), but then run as a serial executable `./bin/xspecfem3d` for single NPROC==1 simulations. if that is also your case, you will need to run it with the `mpirun` launcher around, like ``` mpirun -np 1 ./bin/xspecfem3d ``` and same for the other executables like `xmeshfem3d` and `xgenerate_databases`

planetarianPKU · 2024-01-25T04:33:22Z

Dear Doc. Danielpeter:

Thank you for your very useful advice！

Following your advices, I compiled and ran the devel version you updated yesterday and it works perfectly with no error, that's good.

Then I recompiled my 2018 ver code again with no DEBUG_FLAG, and it show errors again. And I check my sbatch bash, I did run the mpirun:

mpirun -np $NPROC ./bin/xmeshfem3D
mpirun -np $NPROC ./bin/xgenerate_databases
mpirun -np $NPROC ./bin/xspecfem3D

and there are mpi filles from proc000000_XX to proc000003_XX in the DATABASES_MPI, and I wrote a lot of myrank prints in specfem3d.f90 to monitor the progress of each processes. So I'm sure I did run the mpi program.

Anyway I will use the 2018 verr in DEBUG mode and maybe change my work to new devel version in the future.

Jingnan

danielpeter · 2024-01-25T11:50:09Z

no Doc please, just daniel... - we can do a doctor's like surgical operation described below if you like :)

and thanks for the feedback. there can indeed be a problem in the old search_kdtree.f90 code for Intel compilers. as stated in the source code file:

 ! note: compiling with intel ifort version 18.0.1/19.1.0 and optimizations like -xHost -O2 or -xHost -O3 flags
  !       can lead to issues with the deallocate(workindex) statement below:
  !         *** Error in `./bin/xspecfem3D': double free or corruption (!prev): 0x00000000024f1610 ***
  !
  !       this might be due to a more aggressive optimization which leads to a change of the instruction set
  !       and the memory being free twice.
  !       a way to avoid this is by removing -xHost from FLAGS_CHECK = .. in Makefile
  !       or to use a pointer array instead of an allocatable array
  !
  ! integer,dimension(:),allocatable :: workindex

it seems you're using an Intel compiler, so as it says, you can either try

to compile without the -xHost flag or
do a surgical operation and replace your old version file src/shared/search_kdtree.f90 with the new one from the recent devel branch where this has been fixed

happy coding :)

planetarianPKU · 2024-01-26T03:08:16Z

Dear daniel:

Following your advice, I copy the search_kdtree.f90 from 2024 devel version to 2018 master version, and still holds the -xHost flag when compiling, and it works.

FLAGS_CHECK = -xHost -fpe0 -ftz -assume buffered_io -assume byterecl -align sequence -std03 -diag-disable 6477 -implicitnone -gen-interfaces -warn all -O3 -check nobounds

Then I compile the unchanged 2018 version without the -xHost flag, it works too.

FLAGS_CHECK = -fpe0 -ftz -assume buffered_io -assume byterecl -align sequence -std03 -diag-disable 6477 -implicitnone -gen-interfaces -warn all -O3 -check nobounds
Codes from both methods are equally fast, running in one-fifth the time of running in DEBUG mode and the snapshots and waveforms of receivers are correct after checking. This really saves most of the time-comsuming and solves a confusion that had been bothering me for a long time —— ( there is no problem with the source code, and there seems no problem with my tiny modifications, so why do I get an error? Wait, why is the source code also reporting an error? When did I change it?). Anyway that problem is fully solved now.

you are very gorgeous and I truly appreciate your help.

Jingnan

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Memory access related errors] Error in `./bin/xspecfem3D': double free or corruption (!prev) #1674

[Memory access related errors] Error in `./bin/xspecfem3D': double free or corruption (!prev) #1674

planetarianPKU commented Jan 24, 2024

danielpeter commented Jan 24, 2024 via email •

edited

Loading

planetarianPKU commented Jan 25, 2024

danielpeter commented Jan 25, 2024

planetarianPKU commented Jan 26, 2024

[Memory access related errors] Error in `./bin/xspecfem3D': double free or corruption (!prev) #1674

[Memory access related errors] Error in `./bin/xspecfem3D': double free or corruption (!prev) #1674

Comments

planetarianPKU commented Jan 24, 2024

danielpeter commented Jan 24, 2024 via email • edited Loading

planetarianPKU commented Jan 25, 2024

danielpeter commented Jan 25, 2024

planetarianPKU commented Jan 26, 2024

danielpeter commented Jan 24, 2024 via email •

edited

Loading