-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Maximizing GPU utilization #84
Comments
I am not sure about recent updates, but in the past I have always improved the performance massively if I switched CPU processing off completely. I was also only able to do it when I run the console version. |
there are a lot of ways to tune the GPU utilization, but is a little dependent on what exactly you are running. i typically monitor nvidia-smi in the same way you did (often with something like in general, we have the following control options to tune performance with Prismatic (all options can be seen with
if you have enough memory on each GPU unit to contain the memory of the entire S-matrix (for PRISM) or the potential array (for multislice), the first option i would change is to ensure that the the next thing I would recommend would be to set the last settings to change are num-streams, in this scenario, controls the parallelization within a single device (GPU equivalent to num-threads for CPU) by creating new workers. this costs extra GPU memory (linearly scaling with number of streams), and seems to stop getting better after like 4-5 in my own experience (probably cause of the context switching issues, but also because memory copying is slow between the host and device compared to some compute steps). batch size (for both CPU and GPU) controls how many probes a worker thread or stream evaluates before asking for more memory to be transferred from the host. if you have really fast propagation, then this could be limiting, but i've mostly adjusted this in the past to make simulations fit better on the GPU ram. all of this discussion is within the context of a single frozen phonon. if you need to run a ton of frozen phonons and have multiple GPU devices, it's also probably faster to have each simulation be run on a single device and then average the results in post. this is mostly easily achieved by setting the number/ID of visible CUDA devices (done through environment variables); for using pyprismatic, the easiest workaround is probably to use some parallel library like joblib and create a wrapper function which spawns the pyprismatic sim processes under different environemnts |
that's a big dump of info-- I hope it helps clarify a bit of the options involved with the simulation configuration with respect to hardware optimization, though most of this is automatically configured at the start of the simulation to generally decent settings anyway |
@thomasaarholt, you snapshot shows that all GPUs are being used at 30%, is it the case most of the time or just at this specific moment? If the former, this sounds fairly surprising, but without any details on the simulation parameters, it is impossible to figure out what could the cause... I haven't much simulation saving the 4D output, but I noticed that the simulation were significantly slower (5-10x) and a quick profiling showed that the GPU were idle most of the time, which is not the case at all without the 4D output... @lerandc could it be possible that there is a lock somewhere which is not released when writing 4D output and would therefore hold the computation? I should have an minimum example reproducing this issue, I can find it in my data if this is useful. |
I believe the 30% is referring to the GPU fan speed...? The utilization is reading as 0%. |
Indeed! 🤦♂️, I guess I have been spoiled by using https://github.com/wookayin/gpustat 😅
That should be very quick to check and confirm. |
Confirmed 4D slowdown. Model: n10_Prismatic_RT.zip This slowdown also reduces my GPU utilization to 0. This explain the slowdown observed in my first post, and reduces my confusion significantly. Thanks @ericpre for introducing me to |
ah yes, the 4D write is definitely extremely slow- for a couple of reasons. currently, prismatic never holds onto the full 4D output in memory at any single point in time. this is mostly because the 4D arrays can be huge, of course! if we were to run a simulation with as many probes as you have there, with a final output resolution of 256x256, the array would occupy almost 40 gigabytes in ram--probably much more than everything else needed in the simulation! instead, what we do is to save the 4D output for each probe as soon as the propagation finishes-- back before v1.2, this was just a massive dump of .mrc files for each frozen phonon, which, while extremely cumbersome (especially considering FP averaging would be a post-processing step), is pretty fast. when we moved to HDF5, we maintained the same idea of saving at the end of each calculation-- though now, we have a shared output resource that all calculation threads must access, so we have a write process that looks like this for a 4D simulation (as of v1.2.1, slightly different in my development branch): per probe (controlled by a thread):
all of steps 3 to 7 are accomplished with CPU resources (either worker thread or the host thread for a GPU stream), and all of steps 4 to 7 are mutex locked such that only a single worker thread can access the HDF5 file at once. that is to say, only a single CBED is ever written to disk at once and it is serially done across all worker threads. steps 2, 3, 5, and 6 are fast in all scenarios the desired behavior is such that when each thread finishes its calculation, you have some sort of race for the HDF5 resource assuming all threads get there about the same time. each thread must wait its turn, so the result is that the threads are now offset from each other in real time with respect to calculation progress, and when they reach the IO resources the next time they try to output data, they no longer have to wait to access it. however, in a case like your experiment where it seems like we have the following order (based off of the impressive 15 sec run time for the 3D case) I think this could probably be "easily" solved with parallel HDF5-- after all, the probes are inherently thread-safe operations since they operate in parts of real space, such that the file IO is truly parallel. I haven't been able to invest time into figuring it out, though. the documentation for it all is also hard to parse through and there is not much (recent) discussion of implementing such features on either the HDF5 forums or forums like stack overflow. I'm pretty sure h5py supports parallel IO, so there might be some worth at some point investigating their implementation of it |
Thanks @lerandc for the information. Even if the dataset is very large, it sounds surprising that the bottleneck is with reading/writing dataset because in @thomasaarholt case, 8 min sounds far too long! This could be check easily by cropping the 4D dataset at 0.1 mrad for example, so it is still simulating 4D dataset but the size of the dataset is very small. |
Yep (take a look at the calls below to make sure I interpreted correctly what you asked me to do).
|
I haven't done a deep profile of how much time is spent where, but I suspect it's not so dependent in this case on the size of the array but the overhead of opening, accessing, and selecting components of the HDF5 file stream. the steps involved in 1.2.1 implementation include: acquire write lock, then
write lock released to be honest, I don't remember why all of these steps are necessary-- I forget the testing that went on when I implemented this last year. edit: or even if all of them are the results @thomasaarholt just dropped in seem to support that it is overhead limited in this scenario |
the model above is mostly used in my dev branch too but perhaps can be improved upon a good bit by keeping the dataset and memory spaces themeselves as persisting objects instead in the parameter class instead of just the filestream. this is formally "less safe" but probably safe enough for the calculation conditions we have |
Not sure which your are using and if the I have used multislice (to avoid the calculation of the S-matrix on the CPU) and disable the CPU workers.
In the example above the timing difference between 3D and 4D output per pixel is 5 ms ((223-34)/(191*191)) which could indeed be related to the the overhead, even if this is still quite high! There is actually no need to write to file at the end of each probe and it could be done in batches, since it should be possible to fit many probe into memory.
Maybe this library could be useful: https://github.com/BlueBrain/HighFive |
@lerandc Have you had a chance to look at this? |
How do users monitor GPU utilization, and what parameters are possible to change to maximize utilization?
I am lucky enough to run on a system with 4 RTX2080TIs, but I don't know if they are being fully utilized. In fact, the tool
nvidia-smi
, which reports back on the GPU states, seems to think that the GPUs are not very busy at all, under "GPU-Util"Here's a snapshot during a small simulation, that takes about 5 minutes:
Is there anything I should be thinking about to improve performance?
The text was updated successfully, but these errors were encountered: