-
Notifications
You must be signed in to change notification settings - Fork 526
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reproducibility of LAMMPS run with DP potential #3270
Comments
There is no way of reproducing long MD trajectories due to the chaotic nature of the many-body dynamic systems. |
Thanks for the reply, but the deviation starts only after 10 ps. I do not think this should be expected....? The truncation error discussed in #1656 makes more sense to me. I am wondering if there is a way to improve this? |
I would not expect a consistency beyond 1000 time steps. |
Hi Wanghan, Could you please provide further details on this? Feel free to correct me if I'm mistaken. I anticipate that a potential model, once trained, should be deterministic in its inference step, similar to a trained neural network model. Thus, would you consider this potential model (specifically, a deepmd trained potential) to be stochastic? If so, could you explain how it operates in that manner? Thanks, Ariana |
It would actually be interesting to see how non-deterministic deepmd's use of Tensorflow actually is and what this means for an MD trajectory. |
Our customized CUDA OP also uses non-deterministic
The deterministic implementation may need extra effort, which might not be worth doing. |
Yep--already found this and we have been working on re-coding it. |
I would like to measure how much the custom CUDA kernel contributes, and how much any TF ops contribute. I am wondering if there is a way to use GPU for TF but disable the CUDA prod_force.cu kernel? It seems DP_VARIANT is all or nothing from my tests. The other thing I am hung up on is trying to print model weights from the frozen model. I can't seem to get any of the tf1 compat methods to do it, I guess since it was made with TF1 on top of TF2 or something. I would love to compare the model weights over multiple "identical" runs of training. |
By the way, there seems to be another source of "non-determinism" in the deepmd code that may actually be a bug. I ran the Digging in, I see the learning rate is scheduled with a tf1 module. Now, this shouldn't be parallel and cause the non-determinism like in the other ops that comes from CUDA atomics, I wouldn't think. Maybe it's an uninitialized variable or something? Or some sort of rounding instability. But this causes dramatic differences in reproducibility of training on identical data with identical settings/hyperparameters and stack. |
deepmd-kit/source/op/prod_force_grad_multi_device.cc Lines 275 to 276 in 91049df
This is a bit complex, but the
Do you change the number of training steps? The learning rate depends on it. deepmd-kit/deepmd/tf/utils/learning_rate.py Lines 89 to 91 in 91049df
|
learning rate: Number of steps were the same, nothing was different except I ran it again, I am pretty sure. I will run some more tests to verify. What I am trying to do is enable TF to run on GPU but disable all the local deepmd CUDA kernels (non-TF). I guess I can go in and comment those all out and then build with GPU to get TF on the device. Will check out the model checking options, thanks... |
So I've done some reproducibility testing just on model training and inference. I ran the exact same training on the same data, same hyperparameters, twice to get 2 "identical" models on two different DFT datasets. I ran this with different sets of training step number. Then I ran I have some baffling results. When I look at the maximum absolute difference in predicted force components (x, y, z) for one system (120 atoms, 110,000 training frames) the variations between "identical" training runs are pretty huge. Some atoms' predicted force components across the two "identical" trainings can be as high as 1 eV/Å. It increases with number of training steps: around 0.2 eV/Å for 100,000 training steps, 0.4 eV/Å for 200K training steps, and over 1 eV/Å for 1M training steps. These numbers were confirmed on a different system running the deepmd-kit container. For the other system, 623 atoms and ~60K training frames, the maximum absolute difference is much lower, about 1.3e-11 eV/Å for 20K steps, about 1e-10 eV/Å for 100K training steps (this system takes longer to train so I am still getting data for longer training times). But it's a HUGE difference in non-deterministic variation between these systems. The other thing that is troubling is that for both systems, changing the random seed leads to a max abs difference in predicted force components of around 0.4 eV/Å. I am sort of wondering if there is some bug in the code or the test module, because none of this makes any sense, especially the massive max differences for the one smaller system. It would be good to run more tests on other datasets. I found a few things online. Tests were all run with a pip install of DeePMD-kit on an x86 AMD EPYC CPU + NVIDIA A100 GPU with Ubuntu OS. |
Do you get the same behavior with the CPU, or is it only the behavior of the GPU? Printing to Please note, according to TF documentation, |
Yes, there should be some nondeterminism with TF. But I didn't expect it to affect the forces THAT much. That's a lot. And it seems strange that it would affect one system so much and not the other. I will run some tests with CPU-only training and also with the CUDA kernels turned off. Good idea about printing lcurve in smaller increments, will also try this. |
I'm also wondering what it would take to turn on TF determinism in DeePMD. Some detailed notes on doing this can be found here: We are working with Duncan/NVIDIA so we can ask questions. I am just not sure what to do with the tf1 compat API on top of TF2 package. It seems to fall through the cracks. If I were to add |
For random seed: we don't use any global random seed. Instead, the seed is passed from the input file, like deepmd-kit/deepmd/utils/network.py Line 43 in b875ea8
For determinism for tf.compat.v1: I don't know and have never used it. The most helpful thing should be tensorflow/community#346 |
Yes, that is what I am talking about. Where in the code would be the top-most entrypoint to add this command so it propagates down to all the TF calls? Or maybe it needs to go in multiple places? |
It is possible to obtain the same model parameters with deepmd provided that
The interested reader can try out the open PR combined with these three variables to be added to their scripts.
We successfully ran the training and inference tests more than twice and got the same answer all the time. |
Summary
Hello,
I’m currently attempting to replicate an NVT simulation. I’ve set the seed for the initial velocity and confirmed that the initial velocity is consistent. I am also using the same machine and same version to run the simulation (single processor). However, I’ve noticed that up to a certain time-step, the positions and velocities start to deviate. I checked the previous cases #1656 and #2270, and it seems this issue came from truncation error. I wonder if there is a way to improve the precision to avoid this from happening? I would like to get a deterministic simulation that I can reproduce with exactly the same result using the same inputs. Thanks!
DeePMD-kit Version
v2.2.1
TensorFlow Version
Python Version, CUDA Version, GCC Version, LAMMPS Version, etc
No response
Details
I’m currently attempting to replicate an NVT simulation. I’ve set the seed for the initial velocity and confirmed that the initial velocity is consistent. I am also using the same machine and same version to run the simulation (single processor). However, I’ve noticed that up to a certain time-step, the positions and velocities start to deviate. I checked the previous cases #1656 and #2270 , and it seems this issue came from truncation error. I wonder if there is a way to improve the precision to avoid this from happening? I would like to get a deterministic simulation that I can reproduce with exactly the same result using the same inputs.
The text was updated successfully, but these errors were encountered: