Sample training data takes tens of minutes one epoch on Linux with an A800 #241

GuohuaQiu1999 · 2024-11-06T19:40:41Z

When using version 0.10.2 of decode and the same parameter.yaml file, the time to sample training data on Linux with an A800 is significantly larger than on Windows with GTX3080Ti .Both using the parameters below for simulation,

Hardware:
  device: cuda:0
  device_ix: 0
  device_simulation: cuda:0
  num_worker_train: 1
  torch_multiprocessing_sharing_strategy: null
  torch_threads: 4
  unix_niceness: 0
Simulation:
  bg_uniform:
  - 40.0
  - 60.0
  density: null
  emitter_av: 250
  emitter_extent:
  - - -0.5
    - 63.5
  - - -0.5
    - 63.5
  - - -2000
    - 2000
  img_size:
  - 64
  - 64
  intensity_mu_sig:
  - 3000.0
  - 100.0

On Windows with GTX3080Ti, the time to sample training data per epoch during training is about 8 seconds. However, on Linux with an A800, it takes tens of minutes (I didn’t wait for it to finish sampling in an epoch because it took too long). I investigated the code and added print statements at key points, and found that it was very slow at the line:

frames = self._spline_impl.forward_frames(*self.img_shape,   
                                          frame_ix,   
                                          n_frames,   
                                          xyz_r[:, 0],   
                                          xyz_r[:, 1],   
                                          xyz_r[:, 2],   
                                          ix[:, 0],   
                                          ix[:, 1], 
                                          weight)

However, using nvidia-smi, I saw that the GPU utilization was consistently at 100%, which is very strange. I specifically checked the spline library and found that it was compiled with sm_37. Could this be the reason for the performance issue? But sm_37 compiled code does not affect the performance on Windows with GTX 3080Ti. Recompiling to test whether it is the problem is quite difficult for me, so I hope to seek your help.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sample training data takes tens of minutes one epoch on Linux with an A800 #241

Sample training data takes tens of minutes one epoch on Linux with an A800 #241

GuohuaQiu1999 commented Nov 6, 2024 •

edited

Loading

Sample training data takes tens of minutes one epoch on Linux with an A800 #241

Sample training data takes tens of minutes one epoch on Linux with an A800 #241

Comments

GuohuaQiu1999 commented Nov 6, 2024 • edited Loading

GuohuaQiu1999 commented Nov 6, 2024 •

edited

Loading