dynamically load op library in C++ interface #1384

njzjz · 2021-12-26T01:48:33Z

In this PR, C++ interface will dynamically load OP libraries, just like Python interface, so it no longer needs linking. Thus, I also remove CMAKE_LINK_WHAT_YOU_USE flag. Note that it needs to set RPATH (which we have already done).

Refer: https://discuss.tensorflow.org/t/how-to-load-custom-op-from-c/5748

C++ interface will dynamically load OP libraries, just like Python interface, so it no longer needs linking.

codecov-commenter · 2021-12-26T01:58:58Z

Codecov Report

Merging #1384 (1142420) into devel (70c0e73) will decrease coverage by 11.25%.
The diff coverage is n/a.

❗ Current head 1142420 differs from pull request most recent head de00e04. Consider uploading reports for the commit de00e04 to get more accurate results

@@             Coverage Diff             @@
##            devel    #1384       +/-   ##
===========================================
- Coverage   75.53%   64.28%   -11.26%     
===========================================
  Files          91        5       -86     
  Lines        7505       14     -7491     
===========================================
- Hits         5669        9     -5660     
+ Misses       1836        5     -1831

Impacted Files	Coverage Δ
deepmd/common.py
deepmd/descriptor/se.py
deepmd/descriptor/se_a.py
deepmd/descriptor/se_r.py
deepmd/descriptor/se_t.py
deepmd/entrypoints/freeze.py
deepmd/entrypoints/main.py
deepmd/env.py
deepmd/fit/dipole.py
deepmd/fit/ener.py
... and 76 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 70c0e73...de00e04. Read the comment docs.

source/install/test_cc.sh

It is not necessary anymore.

wanghan-iapcm · 2022-01-10T02:55:22Z

@denghuilu please check if it works.

denghuilu · 2022-01-14T03:56:51Z

An error occurs during the MD process: Not sure what's going on, I'll check it this afternoon

root lmp $ git branch
* dynamically-load-op-library
root lmp $ mpirun --allow-run-as-root -n 1 /root/denghui/lammps/src/lmp_mpi < in.lammps 
LAMMPS (29 Sep 2021)
Reading data file ...
  triclinic box = (0.0000000 0.0000000 0.0000000) to (12.444700 12.444700 12.444700) with tilt (0.0000000 0.0000000 0.0000000)
  1 by 1 by 1 MPI processor grid
  reading atoms ...
  192 atoms
  read_data CPU = 0.001 seconds
Summary of lammps deepmd module ...
  >>> Info of deepmd-kit:
  installed to:       /root/denghui/deepmd_root
  source:             v2.0.2-60-g0f61527-dirty
  source branch:       dynamically-load-op-library
  source commit:      0f61527
  source commit at:   2022-01-09 06:24:45 -0500
  surpport model ver.:1.1 
  build float prec:   double
  build with tf inc:  /root/denghui/tensorflow_root/include;/root/denghui/tensorflow_root/include
  build with tf lib:  /root/denghui/tensorflow_root/lib/libtensorflow_cc.so;/root/denghui/tensorflow_root/lib/libtensorflow_framework.so
  set tf intra_op_parallelism_threads: 0
  set tf inter_op_parallelism_threads: 0
  >>> Info of lammps module:
  use deepmd-kit at:  /root/denghui/deepmd_root
  source:             v2.0.2-60-g0f61527-dirty
  source branch:      dynamically-load-op-library
  source commit:      0f61527
  source commit at:   2022-01-09 06:24:45 -0500
  build float prec:   double
  build with tf inc:  /root/denghui/tensorflow_root/include;/root/denghui/tensorflow_root/include
  build with tf lib:  /root/denghui/tensorflow_root/lib/libtensorflow_cc.so;/root/denghui/tensorflow_root/lib/libtensorflow_framework.so
*** Error in `/root/denghui/lammps/src/lmp_mpi': free(): invalid pointer: 0x0000000002314268 ***

denghuilu · 2022-01-15T14:57:14Z

@njzjz after setting the LD_LIBRARY_PATH of $deepmd_root/lib, the MD process goes well.

root lmp $ mpirun --allow-run-as-root -n 1 /root/denghui/lammps/src/lmp_mpi < in.lammps 
LAMMPS (29 Sep 2021)
Reading data file ...
  triclinic box = (0.0000000 0.0000000 0.0000000) to (12.444700 12.444700 12.444700) with tilt (0.0000000 0.0000000 0.0000000)
  1 by 1 by 1 MPI processor grid
  reading atoms ...
  192 atoms
  read_data CPU = 0.001 seconds
Summary of lammps deepmd module ...
  >>> Info of deepmd-kit:
  installed to:       /root/denghui/deepmd_root
  source:             v2.0.2-60-g0f61527-dirty
  source branch:       dynamically-load-op-library
  source commit:      0f61527
  source commit at:   2022-01-09 06:24:45 -0500
  surpport model ver.:1.1 
  build float prec:   double
  build with tf inc:  /root/denghui/tensorflow_root/include;/root/denghui/tensorflow_root/include
  build with tf lib:  /root/denghui/tensorflow_root/lib/libtensorflow_cc.so;/root/denghui/tensorflow_root/lib/libtensorflow_framework.so
  set tf intra_op_parallelism_threads: 0
  set tf inter_op_parallelism_threads: 0
  >>> Info of lammps module:
  use deepmd-kit at:  /root/denghui/deepmd_root
  source:             v2.0.2-60-g0f61527-dirty
  source branch:      dynamically-load-op-library
  source commit:      0f61527
  source commit at:   2022-01-09 06:24:45 -0500
  build float prec:   double
  build with tf inc:  /root/denghui/tensorflow_root/include;/root/denghui/tensorflow_root/include
  build with tf lib:  /root/denghui/tensorflow_root/lib/libtensorflow_cc.so;/root/denghui/tensorflow_root/lib/libtensorflow_framework.so
*** Error in `/root/denghui/lammps/src/lmp_mpi': free(): invalid pointer: 0x0000000002094268 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x81329)[0x7f78ad80c329]
/root/denghui/deepmd_root/lib/libdeepmd_cc.so(+0x1ee7d)[0x7f78c978fe7d]
/root/denghui/deepmd_root/lib/libdeepmd_cc.so(_ZN6deepmd15load_op_libraryEv+0x104)[0x7f78c97905c4]
/root/denghui/deepmd_root/lib/libdeepmd_cc.so(_ZN6deepmd7DeepPot4initERKSsRKiS2_+0x80)[0x7f78c97869a0]
/root/denghui/lammps/src/lmp_mpi[0x86ed77]
/root/denghui/lammps/src/lmp_mpi[0x40c5b2]
/root/denghui/lammps/src/lmp_mpi[0x414441]
/root/denghui/lammps/src/lmp_mpi[0x4147ed]
/root/denghui/lammps/src/lmp_mpi[0x409088]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f78ad7ad555]
/root/denghui/lammps/src/lmp_mpi[0x409fe8]
======= Memory map: ========
00400000-0099d000 r-xp 00000000 fd:01 23862972                           /root/denghui/lammps/src/lmp_mpi
00b9c000-00b9d000 r--p 0059c000 fd:01 23862972                           /root/denghui/lammps/src/lmp_mpi
00b9d000-00ba0000 rw-p 0059d000 fd:01 23862972                           /root/denghui/lammps/src/lmp_mpi
00ba0000-00ba5000 rw-p 00000000 00:00 0 
00bd0000-020b1000 rw-p 00000000 00:00 0                                  [heap]
7f7890000000-7f7890021000 rw-p 00000000 00:00 0 
7f7890021000-7f7894000000 ---p 00000000 00:00 0 
7f78973ec000-7f7897bed000 rw-p 00000000 00:00 0 
7f7897bed000-7f7897bf3000 r-xp 00000000 fd:01 22545198                   /root/denghui/openmpi-4.0.6/lib/openmpi/mca_osc_sm.so
7f7897bf3000-7f7897df2000 ---p 00006000 fd:01 22545198                   /root/denghui/openmpi-4.0.6/lib/openmpi/mca_osc_sm.so
7f7897df2000-7f7897df3000 r--p 00005000 fd:01 22545198                   /root/denghui/openmpi-4.0.6/lib/openmpi/mca_osc_sm.so
7f7897df3000-7f7897df5000 rw-p 00006000 fd:01 22545198                   /root/denghui/openmpi-4.0.6/lib/openmpi/mca_osc_sm.so
7f7898000000-7f7898021000 rw-p 00000000 00:00 0 
7f7898021000-7f789c000000 ---p 00000000 00:00 0 
7f789c1f8000-7f789c22f000 r-xp 00000000 fd:01 22545204                   /root/denghui/openmpi-4.0.6/lib/openmpi/mca_osc_rdma.so
7f789c22f000-7f789c42e000 ---p 00037000 fd:01 22545204                   /root/denghui/openmpi-4.0.6/lib/openmpi/mca_osc_rdma.so
7f789c42e000-7f789c42f000 r--p 00036000 fd:01 22545204                   /root/denghui/openmpi-4.0.6/lib/openmpi/mca_osc_rdma.so
7f789c42f000-7f789c430000 rw-p 00037000 fd:01 22545204                   /root/denghui/openmpi-4.0.6/lib/openmpi/mca_osc_rdma.so
7f789c430000-7f789c440000 r-xp 00000000 fd:01 22545206                   /root/denghui/openmpi-4.0.6/lib/openmpi/mca_osc_ucx.so
7f789c440000-7f789c63f000 ---p 00010000 fd:01 22545206                   /root/denghui/openmpi-4.0.6/lib/openmpi/mca_osc_ucx.so
7f789c63f000-7f789c640000 r--p 0000f000 fd:01 22545206                   /root/denghui/openmpi-4.0.6/lib/openmpi/mca_osc_ucx.so
7f789c640000-7f789c649000 rw-p 00010000 fd:01 22545206                   /root/denghui/openmpi-4.0.6/lib/openmpi/mca_osc_ucx.so
7f789c649000-7f789c667000 r-xp 00000000 fd:01 22545202                   /root/denghui/openmpi-4.0.6/lib/openmpi/mca_osc_pt2pt.so
7f789c667000-7f789c866000 ---p 0001e000 fd:01 22545202                   /root/denghui/openmpi-4.0.6/lib/openmpi/mca_osc_pt2pt.so
7f789c866000-7f789c867000 r--p 0001d000 fd:01 22545202                   /root/denghui/openmpi-4.0.6/lib/openmpi/mca_osc_pt2pt.so
7f789c867000-7f789c868000 rw-p 0001e000 fd:01 22545202                   /root/denghui/openmpi-4.0.6/lib/openmpi/mca_osc_pt2pt.so
7f789c868000-7f789c86b000 r-xp 00000000 fd:01 22545174                   /root/denghui/openmpi-4.0.6/lib/openmpi/mca_coll_sync.so
7f789c86b000-7f789ca6b000 ---p 00003000 fd:01 22545174                   /root/denghui/openmpi-4.0.6/lib/openmpi/mca_coll_sync.so
7f789ca6b000-7f789ca6c000 r--p 00003000 fd:01 22545174                   /root/denghui/openmpi-4.0.6/lib/openmpi/mca_coll_sync.so
7f789ca6c000-7f789ca6d000 rw-p 00004000 fd:01 22545174                   /root/denghui/openmpi-4.0.6/lib/openmpi/mca_coll_sync.so
7f789ca6d000-7f789ca79000 r-xp 00000000 fd:01 22545164                   /root/denghui/openmpi-4.0.6/lib/openmpi/mca_coll_basic.so
7f789ca79000-7f789cc79000 ---p 0000c000 fd:01 22545164                   /root/denghui/openmpi-4.0.6/lib/openmpi/mca_coll_basic.so
7f789cc79000-7f789cc7a000 r--p 0000c000 fd:01 22545164                   /root/denghui/openmpi-4.0.6/lib/openmpi/mca_coll_basic.so
7f789cc7a000-7f789cc7b000 rw-p 0000d000 fd:01 22545164                   /root/denghui/openmpi-4.0.6/lib/openmpi/mca_coll_basic.so
7f789cc7b000-7f789cc8c000 r-xp 00000000 fd:01 22545176                   /root/denghui/openmpi-4.0.6/lib/openmpi/mca_coll_tuned.so
7f789cc8c000-7f789ce8b000 ---p 00011000 fd:01 22545176                   /root/denghui/openmpi-4.0.6/lib/openmpi/mca_coll_tuned.so
7f789ce8b000-7f789ce8c000 r--p 00010000 fd:01 22545176                   /root/denghui/openmpi-4.0.6/lib/openmpi/mca_coll_tuned.so
7f789ce8c000-7f789ce8d000 rw-p 00011000 fd:01 22545176                   /root/denghui/openmpi-4.0.6/lib/openmpi/mca_coll_tuned.so
7f789d09c000-7f789d0a0000 r-xp 00000000 fd:01 22545166                   /root/denghui/openmpi-4.0.6/lib/openmpi/mca_coll_inter.so
7f789d0a0000-7f789d29f000 ---p 00004000 fd:01 22545166                   /root/denghui/openmpi-4.0.6/lib/openmpi/mca_coll_inter.so[VM-0-4-centos:11397] *** Process received signal ***
[VM-0-4-centos:11397] Signal: Aborted (6)
[VM-0-4-centos:11397] Signal code:  (-6)
[VM-0-4-centos:11397] [ 0] /lib64/libpthread.so.0(+0xf630)[0x7f78adb68630]
[VM-0-4-centos:11397] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x7f78ad7c1387]
[VM-0-4-centos:11397] [ 2] /lib64/libc.so.6(abort+0x148)[0x7f78ad7c2a78]
[VM-0-4-centos:11397] [ 3] /lib64/libc.so.6(+0x78f67)[0x7f78ad803f67]
[VM-0-4-centos:11397] [ 4] /lib64/libc.so.6(+0x81329)[0x7f78ad80c329]
[VM-0-4-centos:11397] [ 5] /root/denghui/deepmd_root/lib/libdeepmd_cc.so(+0x1ee7d)[0x7f78c978fe7d]
[VM-0-4-centos:11397] [ 6] /root/denghui/deepmd_root/lib/libdeepmd_cc.so(_ZN6deepmd15load_op_libraryEv+0x104)[0x7f78c97905c4]
[VM-0-4-centos:11397] [ 7] /root/denghui/deepmd_root/lib/libdeepmd_cc.so(_ZN6deepmd7DeepPot4initERKSsRKiS2_+0x80)[0x7f78c97869a0]
[VM-0-4-centos:11397] [ 8] /root/denghui/lammps/src/lmp_mpi[0x86ed77]
[VM-0-4-centos:11397] [ 9] /root/denghui/lammps/src/lmp_mpi[0x40c5b2]
[VM-0-4-centos:11397] [10] /root/denghui/lammps/src/lmp_mpi[0x414441]
[VM-0-4-centos:11397] [11] /root/denghui/lammps/src/lmp_mpi[0x4147ed]
[VM-0-4-centos:11397] [12] /root/denghui/lammps/src/lmp_mpi[0x409088]
[VM-0-4-centos:11397] [13] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f78ad7ad555]
[VM-0-4-centos:11397] [14] /root/denghui/lammps/src/lmp_mpi[0x409fe8]
[VM-0-4-centos:11397] *** End of error message ***
I'm in 1
Not found: libdeepmd_op.so: cannot open shared object file: No such file or directory
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node VM-0-4-centos exited on signal 6 (Aborted).
--------------------------------------------------------------------------
root lmp $ export LD_LIBRARY_PATH=$deepmd_root/lib:$LD_LIBRARY_PATH
root lmp $ mpirun --allow-run-as-root -n 1 /root/denghui/lammps/src/lmp_mpi < in.lammps 
LAMMPS (29 Sep 2021)
Reading data file ...
  triclinic box = (0.0000000 0.0000000 0.0000000) to (12.444700 12.444700 12.444700) with tilt (0.0000000 0.0000000 0.0000000)
  1 by 1 by 1 MPI processor grid
  reading atoms ...
  192 atoms
  read_data CPU = 0.001 seconds
Summary of lammps deepmd module ...
  >>> Info of deepmd-kit:
  installed to:       /root/denghui/deepmd_root
  source:             v2.0.2-60-g0f61527-dirty
  source branch:       dynamically-load-op-library
  source commit:      0f61527
  source commit at:   2022-01-09 06:24:45 -0500
  surpport model ver.:1.1 
  build float prec:   double
  build with tf inc:  /root/denghui/tensorflow_root/include;/root/denghui/tensorflow_root/include
  build with tf lib:  /root/denghui/tensorflow_root/lib/libtensorflow_cc.so;/root/denghui/tensorflow_root/lib/libtensorflow_framework.so
  set tf intra_op_parallelism_threads: 0
  set tf inter_op_parallelism_threads: 0
  >>> Info of lammps module:
  use deepmd-kit at:  /root/denghui/deepmd_root
  source:             v2.0.2-60-g0f61527-dirty
  source branch:      dynamically-load-op-library
  source commit:      0f61527
  source commit at:   2022-01-09 06:24:45 -0500
  build float prec:   double
  build with tf inc:  /root/denghui/tensorflow_root/include;/root/denghui/tensorflow_root/include
  build with tf lib:  /root/denghui/tensorflow_root/lib/libtensorflow_cc.so;/root/denghui/tensorflow_root/lib/libtensorflow_framework.so
I'm in 1
I'm in 2
2022-01-15 22:52:12.662306: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-01-15 22:52:12.662708: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-01-15 22:52:12.674188: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-01-15 22:52:12.675272: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-01-15 22:52:13.352641: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-01-15 22:52:13.353758: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-01-15 22:52:13.354812: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-01-15 22:52:13.355875: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 31006 MB memory:  -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:00:08.0, compute capability: 7.0
  >>> Info of model(s):
  using   1 model(s): frozen_model.pb 
  rcut in model:      6
  ntypes in model:    2

CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE

Your simulation uses code contributions which should be cited:
- USER-DEEPMD package:
The log file lists these citations in BibTeX format.

CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE

Neighbor list info ...
  update every 10 steps, delay 0 steps, check no
  max neighbors/atom: 2000, page size: 100000
  master list distance cutoff = 8
  ghost atom cutoff = 8
  binsize = 4, bins = 4 4 4
  1 neighbor lists, perpetual/occasional/extra = 1 0 0
  (1) pair deepmd, perpetual
      attributes: full, newton on
      pair build: full/bin/atomonly
      stencil: full/bin/3d
      bin: standard
Setting up Verlet run ...
  Unit style    : metal
  Current step  : 0
  Time step     : 0.0005
Per MPI rank memory allocation (min/avg/max) = 3.908 | 3.908 | 3.908 Mbytes
Step PotEng KinEng TotEng Temp Press Volume 
       0   -29949.687    8.1472669    -29941.54          330   -10315.294    1927.3176 
     100   -29949.774    8.2396323   -29941.535     333.7412    -17369.19    1927.3176 
     200   -29949.918    8.3734367   -29941.545    339.16087   -14767.494    1927.3176 
     300   -29949.426    7.8874137   -29941.539     319.4748   -10014.159    1927.3176 
     400   -29949.966    8.4216707   -29941.544    341.11455   -15011.137    1927.3176 
     500   -29949.534    7.9793362   -29941.554    323.19807    -19278.38    1927.3176 
     600   -29950.089    8.5298607    -29941.56    345.49672   -10833.846    1927.3176 
     700    -29950.03    8.4502146    -29941.58    342.27071   -10410.325    1927.3176 
     800   -29949.216    7.6218502   -29941.594    308.71832   -19800.675    1927.3176 
     900   -29949.528    7.9217608   -29941.606    320.86602   -14158.487    1927.3176 
    1000    -29949.78    8.1663933   -29941.614     330.7747   -19542.602    1927.3176 
Loop time of 4.72924 on 1 procs for 1000 steps with 192 atoms

Performance: 9.135 ns/day, 2.627 hours/ns, 211.451 timesteps/s
98.0% CPU use with 1 MPI tasks x no OpenMP threads

MPI task timing breakdown:
Section |  min time  |  avg time  |  max time  |%varavg| %total
---------------------------------------------------------------
Pair    | 4.5636     | 4.5636     | 4.5636     |   0.0 | 96.50
Neigh   | 0.13815    | 0.13815    | 0.13815    |   0.0 |  2.92
Comm    | 0.015456   | 0.015456   | 0.015456   |   0.0 |  0.33
Output  | 0.0029724  | 0.0029724  | 0.0029724  |   0.0 |  0.06
Modify  | 0.0064512  | 0.0064512  | 0.0064512  |   0.0 |  0.14
Other   |            | 0.002629   |            |       |  0.06

Nlocal:        192.000 ave         192 max         192 min
Histogram: 1 0 0 0 0 0 0 0 0 0
Nghost:        2152.00 ave        2152 max        2152 min
Histogram: 1 0 0 0 0 0 0 0 0 0
Neighs:         0.00000 ave           0 max           0 min
Histogram: 1 0 0 0 0 0 0 0 0 0
FullNghs:      41092.0 ave       41092 max       41092 min
Histogram: 1 0 0 0 0 0 0 0 0 0

Total # of neighbors = 41092
Ave neighs/atom = 214.02083
Neighbor list builds = 100
Dangerous builds not checked
Total wall time: 0:00:07
root lmp $

njzjz · 2022-01-15T15:36:20Z

@denghuilu Can you check if RPATH is set?

njzjz · 2022-01-15T17:19:46Z

I think rpath should have already been set here:

deepmd-kit/source/lmp/env.sh.in

Line 11 in b88c1da

    
           NNP_LIB=" -Wl,--no-as-needed -l@LIB_DEEPMD_CC@@variant_name@ -ltensorflow_cc -ltensorflow_framework -Wl,-rpath=$TF_RPATH -Wl,-rpath=$DEEPMD_ROOT/lib"

denghuilu · 2022-01-16T01:04:26Z

root lmp $ ldd /root/denghui/lammps/src/lmp_mpi
        linux-vdso.so.1 =>  (0x00007ffe391e6000)
        libdeepmd_cc.so => /root/denghui/deepmd_root/lib/libdeepmd_cc.so (0x00007fc0acc4b000)
        libtensorflow_cc.so.2 => /root/denghui/tensorflow_root/lib/libtensorflow_cc.so.2 (0x00007fc09389e000)
        libtensorflow_framework.so.2 => /root/denghui/tensorflow_root/lib/libtensorflow_framework.so.2 (0x00007fc091d85000)
        libmpi.so.40 => /root/denghui/openmpi-4.0.6/lib/libmpi.so.40 (0x00007fc091a6f000)
        libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007fc091767000)
        libm.so.6 => /lib64/libm.so.6 (0x00007fc091465000)
        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007fc09124f000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fc091033000)
        libc.so.6 => /lib64/libc.so.6 (0x00007fc090c65000)
        libdeepmd.so => /root/denghui/deepmd_root/lib/libdeepmd.so (0x00007fc090a2a000)
        libdeepmd_op_cuda.so => /root/denghui/deepmd_root/lib/libdeepmd_op_cuda.so (0x00007fc0904be000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007fc0902ba000)
        librt.so.1 => /lib64/librt.so.1 (0x00007fc0900b2000)
        libgomp.so.1 => /lib64/libgomp.so.1 (0x00007fc08fe8c000)
        /lib64/ld-linux-x86-64.so.2 (0x00007fc0ace7b000)
        libopen-rte.so.40 => /root/denghui/openmpi-4.0.6/lib/libopen-rte.so.40 (0x00007fc08fbd5000)
        libopen-pal.so.40 => /root/denghui/openmpi-4.0.6/lib/libopen-pal.so.40 (0x00007fc08f8c5000)
        libudev.so.1 => /lib64/libudev.so.1 (0x00007fc08f6af000)
        libutil.so.1 => /lib64/libutil.so.1 (0x00007fc08f4ac000)
        libz.so.1 => /lib64/libz.so.1 (0x00007fc08f296000)
        libcap.so.2 => /lib64/libcap.so.2 (0x00007fc08f091000)
        libdw.so.1 => /lib64/libdw.so.1 (0x00007fc08ee40000)
        libattr.so.1 => /lib64/libattr.so.1 (0x00007fc08ec3b000)
        libelf.so.1 => /lib64/libelf.so.1 (0x00007fc08ea23000)
        liblzma.so.5 => /lib64/liblzma.so.5 (0x00007fc08e7fd000)
        libbz2.so.1 => /lib64/libbz2.so.1 (0x00007fc08e5ed000)

denghuilu · 2022-01-16T01:07:00Z

Have no idea what's going on. Devel branch works fine.

denghuilu · 2022-01-16T15:16:02Z

compiler error

mpicxx -g -O3 -std=c++11 main.o -L/root/denghui/tensorflow_root/lib -L/root/denghui/tensorflow_root/lib -L/root/denghui/deepmd_root/lib     -L. -llammps_mpi -Wl,--no-as-needed -ldeepmd_cc -ltensorflow_cc -ltensorflow_framework -Wl,-rpath=/root/denghui/tensorflow_root/lib -Wl,-rpath=/root/denghui/tensorflow_root/lib -Wl,-rpath=/root/denghui/deepmd_root/lib -rpath=/root/denghui/deepmd_root/lib      -o ../lmp_mpi
g++: error: unrecognized command line option ‘-rpath=/root/denghui/deepmd_root/lib’
make[1]: *** [../lmp_mpi] Error 1
make[1]: Leaving directory `/root/denghui/lammps/src/Obj_mpi'
make: *** [mpi] Error 2

njzjz · 2022-01-16T15:39:28Z

I'll relook at it.

njzjz · 2022-01-16T16:06:14Z

@denghuilu I rechecked 0f61527 by downloading and compiling a new LAMMPS. However, I found no problem running it without setting LD_LIBRARY_PATH.

njzjz · 2022-01-16T16:08:19Z

@denghuilu Can you test the following command?

(base) [jz748@localhost lmp]$ readelf -d /home/jz748/codes/deepmd-kit/dp/lib/libdeepmd_cc.so | head -20

Dynamic section at offset 0x3ac88 contains 38 entries:
  Tag        Type                         Name/Value
 0x0000000000000001 (NEEDED)             Shared library: [libdeepmd.so]
 0x0000000000000001 (NEEDED)             Shared library: [libtensorflow_cc.so.2]
 0x0000000000000001 (NEEDED)             Shared library: [libtensorflow_framework.so.2]
 0x0000000000000001 (NEEDED)             Shared library: [libdeepmd_op_cuda.so]
 0x0000000000000001 (NEEDED)             Shared library: [libpthread.so.0]
 0x0000000000000001 (NEEDED)             Shared library: [libdl.so.2]
 0x0000000000000001 (NEEDED)             Shared library: [librt.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libstdc++.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libm.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libgomp.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libgcc_s.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]
 0x000000000000000e (SONAME)             Library soname: [libdeepmd_cc.so]
 0x000000000000000f (RPATH)              Library rpath: [$ORIGIN:/home/jz748/codes/deepmd-kit/dp/lib]
 0x000000000000000c (INIT)               0xb000
 0x000000000000000d (FINI)               0x347b8
 0x0000000000000019 (INIT_ARRAY)         0x3bb68

As you see, rpath is correctly set.

denghuilu · 2022-01-17T01:24:19Z

Here's the output:

root denghui $ readelf -d /root/denghui/deepmd_root/lib/libdeepmd_cc.so | head -20

Dynamic section at offset 0x2ec68 contains 37 entries:
  Tag        Type                         Name/Value
 0x0000000000000001 (NEEDED)             Shared library: [libdeepmd.so]
 0x0000000000000001 (NEEDED)             Shared library: [libtensorflow_cc.so.2]
 0x0000000000000001 (NEEDED)             Shared library: [libtensorflow_framework.so.2]
 0x0000000000000001 (NEEDED)             Shared library: [libdeepmd_op_cuda.so]
 0x0000000000000001 (NEEDED)             Shared library: [libpthread.so.0]
 0x0000000000000001 (NEEDED)             Shared library: [libdl.so.2]
 0x0000000000000001 (NEEDED)             Shared library: [librt.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libstdc++.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libm.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libgomp.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libgcc_s.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]
 0x000000000000000e (SONAME)             Library soname: [libdeepmd_cc.so]
 0x000000000000000f (RPATH)              Library rpath: [$ORIGIN:/root/denghui/tensorflow_root/lib]
 0x000000000000000c (INIT)               0x90f0
 0x000000000000000d (FINI)               0x28e0c
 0x0000000000000019 (INIT_ARRAY)         0x22eac8

njzjz · 2022-01-17T08:45:36Z

Checking LAMMPS?

(base) [jz748@localhost src]$ readelf -d lmp_serial | head -20

Dynamic section at offset 0x676d90 contains 32 entries:
  Tag        Type                         Name/Value
 0x0000000000000001 (NEEDED)             Shared library: [libdeepmd_cc.so]
 0x0000000000000001 (NEEDED)             Shared library: [libtensorflow_cc.so.2]
 0x0000000000000001 (NEEDED)             Shared library: [libtensorflow_framework.so.2]
 0x0000000000000001 (NEEDED)             Shared library: [libstdc++.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libm.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libgcc_s.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]
 0x000000000000000f (RPATH)              Library rpath: [/home/jz748/codes/deepmd-kit/dp/lib]
 0x000000000000000c (INIT)               0x408000
 0x000000000000000d (FINI)               0x97b788
 0x0000000000000019 (INIT_ARRAY)         0xa778e8
 0x000000000000001b (INIT_ARRAYSZ)       80 (bytes)
 0x000000000000001a (FINI_ARRAY)         0xa77938
 0x000000000000001c (FINI_ARRAYSZ)       8 (bytes)
 0x0000000000000004 (HASH)               0x400378
 0x000000006ffffef5 (GNU_HASH)           0x400c58
 0x0000000000000005 (STRTAB)             0x402b28

denghuilu · 2022-01-17T08:54:28Z

readelf -d lmp_mpi | head -20

Dynamic section at offset 0x59cd98 contains 33 entries:
  Tag        Type                         Name/Value
 0x0000000000000001 (NEEDED)             Shared library: [libdeepmd_cc.so]
 0x0000000000000001 (NEEDED)             Shared library: [libtensorflow_cc.so.2]
 0x0000000000000001 (NEEDED)             Shared library: [libtensorflow_framework.so.2]
 0x0000000000000001 (NEEDED)             Shared library: [libmpi.so.40]
 0x0000000000000001 (NEEDED)             Shared library: [libstdc++.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libm.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libgcc_s.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libpthread.so.0]
 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]
 0x000000000000001d (RUNPATH)            Library runpath: [/root/denghui/tensorflow_root/lib:/root/denghui/deepmd_root/lib:/root/denghui/openmpi-4.0.6/lib]
 0x000000000000000c (INIT)               0x407948
 0x000000000000000d (FINI)               0x8b9614
 0x0000000000000019 (INIT_ARRAY)         0xb9cd38
 0x000000000000001b (INIT_ARRAYSZ)       80 (bytes)
 0x000000000000001a (FINI_ARRAY)         0xb9cd88
 0x000000000000001c (FINI_ARRAYSZ)       8 (bytes)
 0x000000006ffffef5 (GNU_HASH)           0x400298
root src $ cd ..
root lammps $ cd ..
root denghui $ cd deepmd-kit/examples/water/lmp/
root lmp $ mpirun --allow-run-as-root -n 1 /root/denghui/lammps/src/lmp_mpi < in.lammps 
LAMMPS (29 Sep 2021)
Reading data file ...
  triclinic box = (0.0000000 0.0000000 0.0000000) to (12.444700 12.444700 12.444700) with tilt (0.0000000 0.0000000 0.0000000)
  1 by 1 by 1 MPI processor grid
  reading atoms ...
  192 atoms
  read_data CPU = 0.001 seconds
Summary of lammps deepmd module ...
  >>> Info of deepmd-kit:
  installed to:       /root/denghui/deepmd_root
  source:             v2.0.2-60-g0f61527-dirty
  source branch:       dynamically-load-op-library
  source commit:      0f61527
  source commit at:   2022-01-09 06:24:45 -0500
  surpport model ver.:1.1 
  build float prec:   double
  build with tf inc:  /root/denghui/tensorflow_root/include;/root/denghui/tensorflow_root/include
  build with tf lib:  /root/denghui/tensorflow_root/lib/libtensorflow_cc.so;/root/denghui/tensorflow_root/lib/libtensorflow_framework.so
  set tf intra_op_parallelism_threads: 0
  set tf inter_op_parallelism_threads: 0
  >>> Info of lammps module:
  use deepmd-kit at:  /root/denghui/deepmd_root
  source:             v2.0.2-60-g0f61527-dirty
  source branch:      dynamically-load-op-library
  source commit:      0f61527
  source commit at:   2022-01-09 06:24:45 -0500
  build float prec:   double
  build with tf inc:  /root/denghui/tensorflow_root/include;/root/denghui/tensorflow_root/include
  build with tf lib:  /root/denghui/tensorflow_root/lib/libtensorflow_cc.so;/root/denghui/tensorflow_root/lib/libtensorflow_framework.so
*** Error in `/root/denghui/lammps/src/lmp_mpi': free(): invalid pointer: 0x0000000002903268 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x81329)[0x7ff180a63329]
/root/denghui/deepmd_root/lib/libdeepmd_cc.so(+0x1eead)[0x7ff19c9e6ead]
/root/denghui/deepmd_root/lib/libdeepmd_cc.so(_ZN6deepmd15load_op_libraryEv+0x104)[0x7ff19c9e75f4]
/root/denghui/deepmd_root/lib/libdeepmd_cc.so(_ZN6deepmd7DeepPot4initERKSsRKiS2_+0x5d)[0x7ff19c9dd9ed]
/root/denghui/lammps/src/lmp_mpi[0x86ed77]
/root/denghui/lammps/src/lmp_mpi[0x40c5b2]
/root/denghui/lammps/src/lmp_mpi[0x414441]
Not found: libdeepmd_op.so: cannot open shared object file: No such file or directory
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7ff180a04555]
/root/denghui/lammps/src/lmp_mpi[0x409fe8]

njzjz · 2022-01-17T11:00:06Z

@denghuilu set the following environment variable before running LAMMPS.

export LD_DEBUG=libs

It will give the following information:

      7598:     find library=libdeepmd_op.so [0]; searching
      7598:      search path=/home/jz748/codes/deepmd-kit/dp/lib:/home/jz748/codes/deepmd-kit/dp/lib/.          (RPATH from file /home/jz748/codes/deepmd-kit/dp/lib/libtensorflow_cc.so.2)
      7598:       trying file=/home/jz748/codes/deepmd-kit/dp/lib/libdeepmd_op.so
      7598:
      7598:
      7598:     calling init: /home/jz748/codes/deepmd-kit/dp/lib/libdeepmd_op.so
      7598:

We can see the search path it tries.

denghuilu · 2022-01-18T02:13:09Z

 build with tf lib:  /root/denghui/tensorflow_root/lib/libtensorflow_cc.so;/root/denghui/tensorflow_root/lib/libtensorflow_framework.so
      2236:
      2236:     calling init: /root/denghui/openmpi-4.0.6/lib/openmpi/mca_topo_treematch.so
      2236:
      2236:
      2236:     calling init: /root/denghui/openmpi-4.0.6/lib/openmpi/mca_topo_basic.so
      2236:
      2236:     find library=libdeepmd_op.so [0]; searching
      2236:      search path=           (RPATH from file /root/denghui/tensorflow_root/lib/libtensorflow_cc.so.2)
      2236:      search path=/root/denghui/tensorflow_root/lib          (RUNPATH from file /root/denghui/lammps/src/lmp_mpi)
      2236:       trying file=/root/denghui/tensorflow_root/lib/libdeepmd_op.so
      2236:      search path=/root/denghui/openmpi-4.0.6/lib            (RUNPATH from file /root/denghui/lammps/src/lmp_mpi)
      2236:       trying file=/root/denghui/openmpi-4.0.6/lib/libdeepmd_op.so
      2236:      search path=/usr/local/cuda-11.0/lib64:tls/x86_64:tls:x86_64:          (LD_LIBRARY_PATH)
      2236:       trying file=/usr/local/cuda-11.0/lib64/libdeepmd_op.so
      2236:       trying file=tls/x86_64/libdeepmd_op.so
      2236:       trying file=tls/libdeepmd_op.so
      2236:       trying file=x86_64/libdeepmd_op.so
      2236:       trying file=libdeepmd_op.so
      2236:      search cache=/etc/ld.so.cache
      2236:      search path=/lib64/tls/x86_64:/lib64/tls:/lib64/x86_64:/lib64:/usr/lib64/tls:/usr/lib64                (system search path)
      2236:       trying file=/lib64/tls/x86_64/libdeepmd_op.so
      2236:       trying file=/lib64/tls/libdeepmd_op.so
      2236:       trying file=/lib64/x86_64/libdeepmd_op.so
      2236:       trying file=/lib64/libdeepmd_op.so
      2236:       trying file=/usr/lib64/tls/libdeepmd_op.so
      2236:       trying file=/usr/lib64/libdeepmd_op.so
      2236:
*** Error in `/root/denghui/lammps/src/lmp_mpi': free(): invalid pointer: 0x0000000003b1e1a8 ***

denghuilu · 2022-01-18T02:14:04Z

$deepmd_root/lib is not within the search path.

njzjz · 2022-01-18T02:36:40Z

fabdac9 should fix it.

denghuilu · 2022-01-18T08:45:53Z

The same error...

root lmp $ git log | head
commit fabdac91d42004409042c87456930797fdc39880
Author: Jinzhe Zeng <[email protected]>
Date:   Mon Jan 17 21:35:27 2022 -0500

    add the absolute path of library directory to cc rpath

commit e988c401f4432082ef2fca0a06d15aebef75be67
Author: Jinzhe Zeng <[email protected]>
Date:   Mon Jan 17 14:19:31 2022 -0500

LAMMPS (29 Sep 2021)
Reading data file ...
  triclinic box = (0.0000000 0.0000000 0.0000000) to (12.444700 12.444700 12.444700) with tilt (0.0000000 0.0000000 0.0000000)
  1 by 1 by 1 MPI processor grid
  reading atoms ...
  192 atoms
  read_data CPU = 0.001 seconds
Summary of lammps deepmd module ...
  >>> Info of deepmd-kit:
  installed to:       /root/denghui/deepmd_root
  source:             v2.0.2-62-gfabdac9-dirty
  source branch:       dynamically-load-op-library
  source commit:      fabdac9
  source commit at:   2022-01-17 21:35:27 -0500
  surpport model ver.:1.1 
  build float prec:   double
  build with tf inc:  /root/denghui/tensorflow_root/include;/root/denghui/tensorflow_root/include
  build with tf lib:  /root/denghui/tensorflow_root/lib/libtensorflow_cc.so;/root/denghui/tensorflow_root/lib/libtensorflow_framework.so
  set tf intra_op_parallelism_threads: 0
  set tf inter_op_parallelism_threads: 0
  >>> Info of lammps module:
  use deepmd-kit at:  /root/denghui/deepmd_root
  source:             v2.0.2-62-gfabdac9-dirty
  source branch:      dynamically-load-op-library
  source commit:      fabdac9
  source commit at:   2022-01-17 21:35:27 -0500
  build float prec:   double
  build with tf inc:  /root/denghui/tensorflow_root/include;/root/denghui/tensorflow_root/include
  build with tf lib:  /root/denghui/tensorflow_root/lib/libtensorflow_cc.so;/root/denghui/tensorflow_root/lib/libtensorflow_framework.so
     21609:
     21609:     calling init: /root/denghui/openmpi-4.0.6/lib/openmpi/mca_topo_treematch.so
     21609:
     21609:
     21609:     calling init: /root/denghui/openmpi-4.0.6/lib/openmpi/mca_topo_basic.so
     21609:
     21609:     find library=libdeepmd_op.so [0]; searching
     21609:      search path=           (RPATH from file /root/denghui/tensorflow_root/lib/libtensorflow_cc.so.2)
     21609:      search path=/root/denghui/tensorflow_root/lib          (RUNPATH from file /root/denghui/lammps/src/lmp_mpi)
     21609:       trying file=/root/denghui/tensorflow_root/lib/libdeepmd_op.so
     21609:      search path=/root/denghui/openmpi-4.0.6/lib            (RUNPATH from file /root/denghui/lammps/src/lmp_mpi)
     21609:       trying file=/root/denghui/openmpi-4.0.6/lib/libdeepmd_op.so
     21609:      search path=/usr/local/cuda-11.0/lib64:tls/x86_64:tls:x86_64:          (LD_LIBRARY_PATH)
     21609:       trying file=/usr/local/cuda-11.0/lib64/libdeepmd_op.so
     21609:       trying file=tls/x86_64/libdeepmd_op.so
     21609:       trying file=tls/libdeepmd_op.so
     21609:       trying file=x86_64/libdeepmd_op.so
     21609:       trying file=libdeepmd_op.so
     21609:      search cache=/etc/ld.so.cache
     21609:      search path=/lib64/tls/x86_64:/lib64/tls:/lib64/x86_64:/lib64:/usr/lib64/tls:/usr/lib64                (system search path)
     21609:       trying file=/lib64/tls/x86_64/libdeepmd_op.so
     21609:       trying file=/lib64/tls/libdeepmd_op.so
     21609:       trying file=/lib64/x86_64/libdeepmd_op.so
     21609:       trying file=/lib64/libdeepmd_op.so
     21609:       trying file=/usr/lib64/tls/libdeepmd_op.so
     21609:       trying file=/usr/lib64/libdeepmd_op.so
     21609:
*** Error in `/root/denghui/lammps/src/lmp_mpi': free(): invalid pointer: 0x0000000003112268 ***

This reverts commit fabdac9.

njzjz · 2022-01-18T09:23:37Z

Ok, I'll take a look...

njzjz · 2022-01-18T11:50:39Z

@denghuilu Finally I reproduce it by add -Wl,--enable-new-dtags. This option started to be default in the new version of compilers.

Under the situation, linker will add RUNPATH instead of RPATH.

 0x000000000000001d (RUNPATH)            Library runpath: [/root/denghui/tensorflow_root/lib:/root/denghui/deepmd_root/lib:/root/denghui/openmpi-4.0.6/lib]

See https://stackoverflow.com/a/43703445/9567349 and https://stackoverflow.com/a/52020177/9567349.

Adding -Wl,--disable-new-dtags flag will resolve the issue. But I am looking for a more reasonable solution...

njzjz · 2022-01-18T12:33:30Z

Ok, I give up finding other solutions... Adding -Wl,--disable-new-dtags should be useful (this flag is available since 2000). @denghuilu Could you also check what symbol dp_ipi uses, rpath or runpath, and if it works? CMake's default behavior is unclear.

denghuilu · 2022-01-19T02:50:21Z

Nothing changed after setting the -Wl, --disable-new-dtags:

mpicxx -g -O3 -std=c++11 main.o -L/root/denghui/tensorflow_root/lib -L/root/denghui/tensorflow_root/lib -L/root/denghui/deepmd_root/lib     -L. -llammps_mpi -Wl,--no-as-needed -ldeepmd_cc -ltensorflow_cc -ltensorflow_framework -Wl,-rpath=/root/denghui/tensorflow_root/lib -Wl,-rpath=/root/denghui/tensorflow_root/lib -Wl,-rpath=/root/denghui/deepmd_root/lib -Wl,--disable-new-dtags      -o ../lmp_mpi
size ../lmp_mpi
   text    data     bss     dec     hex filename
5882685   12560   17384 5912629  5a3835 ../lmp_mpi
make[1]: Leaving directory `/root/denghui/lammps/src/Obj_mpi'
root src $ readelf -d lmp_mpi | head -20

Dynamic section at offset 0x59cd98 contains 33 entries:
  Tag        Type                         Name/Value
 0x0000000000000001 (NEEDED)             Shared library: [libdeepmd_cc.so]
 0x0000000000000001 (NEEDED)             Shared library: [libtensorflow_cc.so.2]
 0x0000000000000001 (NEEDED)             Shared library: [libtensorflow_framework.so.2]
 0x0000000000000001 (NEEDED)             Shared library: [libmpi.so.40]
 0x0000000000000001 (NEEDED)             Shared library: [libstdc++.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libm.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libgcc_s.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libpthread.so.0]
 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]
 0x000000000000001d (RUNPATH)            Library runpath: [/root/denghui/tensorflow_root/lib:/root/denghui/deepmd_root/lib:/root/denghui/openmpi-4.0.6/lib]
 0x000000000000000c (INIT)               0x407948
 0x000000000000000d (FINI)               0x8b9614
 0x0000000000000019 (INIT_ARRAY)         0xb9cd38
 0x000000000000001b (INIT_ARRAYSZ)       80 (bytes)
 0x000000000000001a (FINI_ARRAY)         0xb9cd88
 0x000000000000001c (FINI_ARRAYSZ)       8 (bytes)
 0x000000006ffffef5 (GNU_HASH)           0x400298
root src $ cat Makefile.package
# Settings for libraries used by specific LAMMPS packages
# this file is auto-edited when those packages are included/excluded

PKG_INC =     -std=c++11 -DHIGH_PREC  -DLAMMPS_VERSION_NUMBER=20210929 -I/root/denghui/tensorflow_root/include -I/root/denghui/tensorflow_root/include -I/root/denghui/deepmd_root/include/ 
PKG_PATH =    -L/root/denghui/tensorflow_root/lib -L/root/denghui/tensorflow_root/lib -L/root/denghui/deepmd_root/lib
PKG_LIB =     -Wl,--no-as-needed -ldeepmd_cc -ltensorflow_cc -ltensorflow_framework -Wl,-rpath=/root/denghui/tensorflow_root/lib -Wl,-rpath=/root/denghui/tensorflow_root/lib -Wl,-rpath=/root/denghui/deepmd_root/lib -Wl,--disable-new-dtags
PKG_CPP_DEPENDS = 
PKG_LINK_DEPENDS = 

PKG_SYSINC =  
PKG_SYSLIB =  
PKG_SYSPATH =

denghuilu · 2022-01-19T03:00:00Z

@njzjz Here's my environment:

CentOS Linux release 7.9.2009 (Core)

gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)

cuda-11.0, V11.0.194

openmpi-4.0.6

python 3.7.0

tensorflow-gpu-2.6.0

cmake version 3.21.3

LAMMPS (29 Sep 2021)

njzjz · 2022-01-19T04:36:54Z

~~I have no idea for it,~~ but note that ld program is the actual linker gcc calls.

This flag is added by OpenMPI, see open-mpi/ompi#1089

njzjz · 2022-01-19T05:17:52Z

If mpicxx adds the flag, I don't think we can override it though.

This reverts commit ecccb57.

njzjz · 2022-01-19T09:53:46Z

In de00e04, I call dlopen in our own library, but not use TF's function. @denghuilu I think it will also work with RUNPATH.

denghuilu

It's my mistake. After recompiling DeeePMD-kit, everything works fine.

root lmp $ git log | head
commit de00e04206b93bf87e9c4b64a097266455ccb015
Author: Jinzhe Zeng <[email protected]>
Date:   Wed Jan 19 04:52:07 2022 -0500

    dlopen from dp lib but not TF

commit 3362f99b014259d25b981ad6fa04fe26e5ed3873
Author: Jinzhe Zeng <[email protected]>
Date:   Wed Jan 19 04:14:55 2022 -0500

root lmp $ echo $LD_LIBRARY_PATH
/root/denghui/openmpi-4.0.6/lib:/usr/local/cuda-11.0/lib64:/root/denghui/openmpi-4.0.6/lib:/usr/local/cuda-11.0/lib64:/root/denghui/openmpi-4.0.6/lib:/usr/local/cuda-11.0/lib64:
root lmp $ mpirun --allow-run-as-root -n 1 /root/denghui/lammps/src/lmp_mpi < in.lammps 
LAMMPS (29 Sep 2021)
Reading data file ...
  triclinic box = (0.0000000 0.0000000 0.0000000) to (12.444700 12.444700 12.444700) with tilt (0.0000000 0.0000000 0.0000000)
  1 by 1 by 1 MPI processor grid
  reading atoms ...
  192 atoms
  read_data CPU = 0.001 seconds
Summary of lammps deepmd module ...
  >>> Info of deepmd-kit:
  installed to:       /root/denghui/deepmd_root
  source:             v2.0.2-66-gde00e04-dirty
  source branch:       dynamically-load-op-library
  source commit:      de00e04
  source commit at:   2022-01-19 04:52:07 -0500
  surpport model ver.:1.1 
  build float prec:   double
  build with tf inc:  /root/denghui/tensorflow_root/include;/root/denghui/tensorflow_root/include
  build with tf lib:  /root/denghui/tensorflow_root/lib/libtensorflow_cc.so;/root/denghui/tensorflow_root/lib/libtensorflow_framework.so
  set tf intra_op_parallelism_threads: 0
  set tf inter_op_parallelism_threads: 0
  >>> Info of lammps module:
  use deepmd-kit at:  /root/denghui/deepmd_root
  source:             v2.0.2-66-gde00e04-dirty
  source branch:      dynamically-load-op-library
  source commit:      de00e04
  source commit at:   2022-01-19 04:52:07 -0500
  build float prec:   double
  build with tf inc:  /root/denghui/tensorflow_root/include;/root/denghui/tensorflow_root/include
  build with tf lib:  /root/denghui/tensorflow_root/lib/libtensorflow_cc.so;/root/denghui/tensorflow_root/lib/libtensorflow_framework.so
2022-01-21 09:20:11.789581: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-01-21 09:20:11.789987: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-01-21 09:20:11.801557: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-01-21 09:20:11.802645: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-01-21 09:20:12.474975: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-01-21 09:20:12.476106: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-01-21 09:20:12.477162: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-01-21 09:20:12.478219: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 31006 MB memory:  -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:00:08.0, compute capability: 7.0
  >>> Info of model(s):
  using   1 model(s): frozen_model.pb 
  rcut in model:      6
  ntypes in model:    2

CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE

Your simulation uses code contributions which should be cited:
- USER-DEEPMD package:
The log file lists these citations in BibTeX format.

CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE

Neighbor list info ...
  update every 10 steps, delay 0 steps, check no
  max neighbors/atom: 2000, page size: 100000
  master list distance cutoff = 8
  ghost atom cutoff = 8
  binsize = 4, bins = 4 4 4
  1 neighbor lists, perpetual/occasional/extra = 1 0 0
  (1) pair deepmd, perpetual
      attributes: full, newton on
      pair build: full/bin/atomonly
      stencil: full/bin/3d
      bin: standard
Setting up Verlet run ...
  Unit style    : metal
  Current step  : 0
  Time step     : 0.0005
Per MPI rank memory allocation (min/avg/max) = 3.908 | 3.908 | 3.908 Mbytes
Step PotEng KinEng TotEng Temp Press Volume 
       0   -29944.158    8.1472669   -29936.011          330    37078.187    1927.3176 
     100   -29943.989    7.9877789   -29936.001    323.54004    27603.467    1927.3176 
     200   -29943.349    7.3418604   -29936.007    297.37751    32879.887    1927.3176 
     300   -29944.262    8.2516105   -29936.011    334.22637    27118.163    1927.3176 
     400   -29944.503    8.4884408   -29936.014    343.81903    26527.481    1927.3176 
     500   -29944.535     8.514281   -29936.021    344.86568    40825.342    1927.3176 
     600   -29944.479    8.4484458   -29936.031    342.19906    26730.448    1927.3176 
     700    -29944.57    8.5090059   -29936.061    344.65201    27365.977    1927.3176 
     800   -29943.903    7.8286542   -29936.074    317.09479    34878.898    1927.3176 
     900   -29944.711    8.6057383   -29936.106     348.5701    34243.605    1927.3176 
    1000   -29944.493    8.3574289   -29936.136    338.51248    34715.817    1927.3176 
Loop time of 4.74255 on 1 procs for 1000 steps with 192 atoms

Performance: 9.109 ns/day, 2.635 hours/ns, 210.857 timesteps/s
97.5% CPU use with 1 MPI tasks x no OpenMP threads

MPI task timing breakdown:
Section |  min time  |  avg time  |  max time  |%varavg| %total
---------------------------------------------------------------
Pair    | 4.576      | 4.576      | 4.576      |   0.0 | 96.49
Neigh   | 0.13852    | 0.13852    | 0.13852    |   0.0 |  2.92
Comm    | 0.015699   | 0.015699   | 0.015699   |   0.0 |  0.33
Output  | 0.0030058  | 0.0030058  | 0.0030058  |   0.0 |  0.06
Modify  | 0.0065539  | 0.0065539  | 0.0065539  |   0.0 |  0.14
Other   |            | 0.002799   |            |       |  0.06

Nlocal:        192.000 ave         192 max         192 min
Histogram: 1 0 0 0 0 0 0 0 0 0
Nghost:        2066.00 ave        2066 max        2066 min
Histogram: 1 0 0 0 0 0 0 0 0 0
Neighs:         0.00000 ave           0 max           0 min
Histogram: 1 0 0 0 0 0 0 0 0 0
FullNghs:      40898.0 ave       40898 max       40898 min
Histogram: 1 0 0 0 0 0 0 0 0 0

Total # of neighbors = 40898
Ave neighs/atom = 213.01042
Neighbor list builds = 100
Dangerous builds not checked
Total wall time: 0:00:07

The library type was changed from SHARED to MODULE in deepmodeling#1384. Fixes errors in conda-forge/deepmd-kit-feedstock#31

The library type was changed from SHARED to MODULE in #1384. Fixes errors in conda-forge/deepmd-kit-feedstock#31

dynamically load op library in C++ interface

a3f8d95

C++ interface will dynamically load OP libraries, just like Python interface, so it no longer needs linking.

njzjz added 2 commits December 25, 2021 20:59

version cannot be NULL...

3f33fa5

set LD_LIBRARY_PATH

12bcccb

njzjz force-pushed the dynamically-load-op-library branch from e6e296a to 12bcccb Compare December 26, 2021 02:56

njzjz requested a review from denghuilu December 26, 2021 03:04

njzjz commented Dec 26, 2021

View reviewed changes

source/install/test_cc.sh Outdated Show resolved Hide resolved

install runUnitTest

186ed82

njzjz requested a review from wanghan-iapcm January 8, 2022 09:26

njzjz added 2 commits January 9, 2022 02:02

remove CMAKE_LINK_WHAT_YOU_USE

b674ce9

It is not necessary anymore.

also remove CMAKE_LINK_WHAT_YOU_USE in api_cc/tests

0f61527

wanghan-iapcm approved these changes Jan 10, 2022

View reviewed changes

This comment has been minimized.

Sign in to view

njzjz force-pushed the dynamically-load-op-library branch from aea3d5b to a3f8d95 Compare January 16, 2022 15:39

change the type of op library to MODULE

e988c40

add the absolute path of library directory to cc rpath

fabdac9

Revert "add the absolute path of library directory to cc rpath"

c01d11a

This reverts commit fabdac9.

add -Wl,--disable-new-dtags

ecccb57

njzjz added 2 commits January 19, 2022 04:14

Revert "add -Wl,--disable-new-dtags"

3362f99

This reverts commit ecccb57.

dlopen from dp lib but not TF

de00e04

denghuilu approved these changes Jan 21, 2022

View reviewed changes

wanghan-iapcm merged commit 7068698 into deepmodeling:devel Jan 21, 2022

njzjz mentioned this pull request Mar 11, 2022

fix macos library name #1566

Merged

njzjz added a commit to njzjz/deepmd-kit that referenced this pull request Mar 11, 2022

fix macos library name

23b9234

The library type was changed from SHARED to MODULE in deepmodeling#1384. Fixes errors in conda-forge/deepmd-kit-feedstock#31

njzjz deleted the dynamically-load-op-library branch March 11, 2022 22:30

wanghan-iapcm pushed a commit that referenced this pull request Mar 12, 2022

fix macos library name (#1566)

70fcda6

The library type was changed from SHARED to MODULE in #1384. Fixes errors in conda-forge/deepmd-kit-feedstock#31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dynamically load op library in C++ interface #1384

dynamically load op library in C++ interface #1384

njzjz commented Dec 26, 2021 •

edited

Loading

codecov-commenter commented Dec 26, 2021 •

edited

Loading

wanghan-iapcm commented Jan 10, 2022

denghuilu commented Jan 14, 2022

denghuilu commented Jan 15, 2022

njzjz commented Jan 15, 2022

njzjz commented Jan 15, 2022 •

edited

Loading

denghuilu commented Jan 16, 2022

denghuilu commented Jan 16, 2022

This comment has been minimized.

denghuilu commented Jan 16, 2022

njzjz commented Jan 16, 2022

njzjz commented Jan 16, 2022

njzjz commented Jan 16, 2022

denghuilu commented Jan 17, 2022

njzjz commented Jan 17, 2022

denghuilu commented Jan 17, 2022

njzjz commented Jan 17, 2022 •

edited

Loading

denghuilu commented Jan 18, 2022

denghuilu commented Jan 18, 2022

njzjz commented Jan 18, 2022

denghuilu commented Jan 18, 2022

njzjz commented Jan 18, 2022

njzjz commented Jan 18, 2022 •

edited

Loading

njzjz commented Jan 18, 2022 •

edited

Loading

denghuilu commented Jan 19, 2022

denghuilu commented Jan 19, 2022

njzjz commented Jan 19, 2022 •

edited

Loading

njzjz commented Jan 19, 2022

njzjz commented Jan 19, 2022

denghuilu left a comment

dynamically load op library in C++ interface #1384

dynamically load op library in C++ interface #1384

Conversation

njzjz commented Dec 26, 2021 • edited Loading

codecov-commenter commented Dec 26, 2021 • edited Loading

Codecov Report

wanghan-iapcm commented Jan 10, 2022

denghuilu commented Jan 14, 2022

denghuilu commented Jan 15, 2022

njzjz commented Jan 15, 2022

njzjz commented Jan 15, 2022 • edited Loading

denghuilu commented Jan 16, 2022

denghuilu commented Jan 16, 2022

This comment has been minimized.

denghuilu commented Jan 16, 2022

njzjz commented Jan 16, 2022

njzjz commented Jan 16, 2022

njzjz commented Jan 16, 2022

denghuilu commented Jan 17, 2022

njzjz commented Jan 17, 2022

denghuilu commented Jan 17, 2022

njzjz commented Jan 17, 2022 • edited Loading

denghuilu commented Jan 18, 2022

denghuilu commented Jan 18, 2022

njzjz commented Jan 18, 2022

denghuilu commented Jan 18, 2022

njzjz commented Jan 18, 2022

njzjz commented Jan 18, 2022 • edited Loading

njzjz commented Jan 18, 2022 • edited Loading

denghuilu commented Jan 19, 2022

denghuilu commented Jan 19, 2022

njzjz commented Jan 19, 2022 • edited Loading

njzjz commented Jan 19, 2022

njzjz commented Jan 19, 2022

denghuilu left a comment

Choose a reason for hiding this comment

njzjz commented Dec 26, 2021 •

edited

Loading

codecov-commenter commented Dec 26, 2021 •

edited

Loading

njzjz commented Jan 15, 2022 •

edited

Loading

njzjz commented Jan 17, 2022 •

edited

Loading

njzjz commented Jan 18, 2022 •

edited

Loading

njzjz commented Jan 18, 2022 •

edited

Loading

njzjz commented Jan 19, 2022 •

edited

Loading