Error when running run_evaluation_multi_uniad.sh #114

Eden-Wang1710 · 2024-10-21T08:27:46Z

Hi, sorry for bothering you again.

When running run_evaluation_multi_uniad.sh on 8 GPU for 2 hours, I encounter this error:

Malloc Size=65538 LargeMemoryPoolOffset=65554 
/home/work/Bench2Drive/carla/CarlaUE4.sh: line 5:  7267 Segmentation fault      (core dumped) "$UE4_PROJECT_ROOT/CarlaUE4/Binaries/Linux/CarlaUE4-Linux-Shipping" CarlaUE4 "$@"
Signal 11 caught.
Malloc Size=65538 LargeMemoryPoolOffset=65554 
/home/work/Bench2Drive/carla/CarlaUE4.sh: line 5:  6650 Segmentation fault      (core dumped) "$UE4_PROJECT_ROOT/CarlaUE4/Binaries/Linux/CarlaUE4-Linux-Shipping" CarlaUE4 "$@"
/opt/conda/envs/b2d_zoo/lib/python3.8/site-packages/scipy/optimize/_minpack_py.py:178: RuntimeWarning: The iteration is not making good progress, as measured by the 
  improvement from the last ten iterations.
  warnings.warn(msg, RuntimeWarning)
/home/work/Bench2Drive/Bench2DriveZoo/mmcv/models/modules/transformer.py:135: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:278.)
  shift = bev_queries.new_tensor(
/home/work/Bench2Drive/Bench2DriveZoo/mmcv/core/bbox/coder/detr3d_track_coder.py:94: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  self.post_center_range = torch.tensor(
Exception in thread Thread-2:
Traceback (most recent call last):
  File "/opt/conda/envs/b2d_zoo/lib/python3.8/threading.py", line 932, in _bootstrap_inner
Traceback (most recent call last):
  File "leaderboard/leaderboard/leaderboard_evaluator.py", line 416, in _load_and_run_scenario
    self.run()
  File "/opt/conda/envs/b2d_zoo/lib/python3.8/threading.py", line 870, in run
    self.manager.run_scenario()
  File "/home/work/Bench2Drive/leaderboard/leaderboard/scenarios/scenario_manager.py", line 161, in run_scenario
    self._target(*self._args, **self._kwargs)
  File "/home/work/Bench2Drive/leaderboard/leaderboard/scenarios/scenario_manager.py", line 136, in build_scenarios_loop
    self._tick_scenario()
  File "/home/work/Bench2Drive/leaderboard/leaderboard/scenarios/scenario_manager.py", line 168, in _tick_scenario
    self.scenario.spawn_parked_vehicles(self.ego_vehicles[0])
  File "/home/work/Bench2Drive/leaderboard/leaderboard/scenarios/route_scenario.py", line 236, in spawn_parked_vehicles
    CarlaDataProvider.get_world().tick(self._timeout)
RuntimeError: time-out of 600000ms while waiting for the simulator, make sure the simulator is ready and connected to localhost:40900

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "leaderboard/leaderboard/leaderboard_evaluator.py", line 568, in <module>
    main()
  File "leaderboard/leaderboard/leaderboard_evaluator.py", line 558, in main
    for response in CarlaDataProvider.get_client().apply_batch_sync(new_parked_vehicles):
RuntimeError: time-out of 600000ms while waiting for the simulator, make sure the simulator is ready and connected to localhost:40900
    crashed = leaderboard_evaluator.run(arguments)
  File "leaderboard/leaderboard/leaderboard_evaluator.py", line 481, in run
    crashed = self._load_and_run_scenario(args, config)
  File "leaderboard/leaderboard/leaderboard_evaluator.py", line 416, in _load_and_run_scenario
    self.manager.run_scenario()
  File "leaderboard/leaderboard/leaderboard_evaluator.py", line 146, in _signal_handler
    self.manager.signal_handler(signum, frame)
  File "/home/work/Bench2Drive/leaderboard/leaderboard/scenarios/scenario_manager.py", line 89, in signal_handler
    raise RuntimeError("The simulation took longer than {}s to update".format(self._timeout))
RuntimeError: The simulation took longer than 600.0s to update
Exception in thread Thread-4:
Traceback (most recent call last):
  File "/opt/conda/envs/b2d_zoo/lib/python3.8/threading.py", line 932, in _bootstrap_inner
Traceback (most recent call last):
  File "leaderboard/leaderboard/leaderboard_evaluator.py", line 416, in _load_and_run_scenario
    self.run()
  File "/opt/conda/envs/b2d_zoo/lib/python3.8/threading.py", line 870, in run
    self.manager.run_scenario()
  File "/home/work/Bench2Drive/leaderboard/leaderboard/scenarios/scenario_manager.py", line 161, in run_scenario
    self._target(*self._args, **self._kwargs)
  File "/home/work/Bench2Drive/leaderboard/leaderboard/scenarios/scenario_manager.py", line 136, in build_scenarios_loop
    self._tick_scenario()
  File "/home/work/Bench2Drive/leaderboard/leaderboard/scenarios/scenario_manager.py", line 168, in _tick_scenario
    self.scenario.spawn_parked_vehicles(self.ego_vehicles[0])
  File "/home/work/Bench2Drive/leaderboard/leaderboard/scenarios/route_scenario.py", line 236, in spawn_parked_vehicles
    CarlaDataProvider.get_world().tick(self._timeout)
RuntimeError: time-out of 600000ms while waiting for the simulator, make sure the simulator is ready and connected to localhost:40750

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "leaderboard/leaderboard/leaderboard_evaluator.py", line 568, in <module>
    for response in CarlaDataProvider.get_client().apply_batch_sync(new_parked_vehicles):
RuntimeError: time-out of 600000ms while waiting for the simulator, make sure the simulator is ready and connected to localhost:40750
    main()
  File "leaderboard/leaderboard/leaderboard_evaluator.py", line 558, in main
    crashed = leaderboard_evaluator.run(arguments)
  File "leaderboard/leaderboard/leaderboard_evaluator.py", line 481, in run
    crashed = self._load_and_run_scenario(args, config)
  File "leaderboard/leaderboard/leaderboard_evaluator.py", line 432, in _load_and_run_scenario
    print("\n\033[91mError during the simulation:", flush=True)
  File "leaderboard/leaderboard/leaderboard_evaluator.py", line 146, in _signal_handler
    self.manager.signal_handler(signum, frame)
  File "/home/work/Bench2Drive/leaderboard/leaderboard/scenarios/scenario_manager.py", line 89, in signal_handler
    raise RuntimeError("The simulation took longer than {}s to update".format(self._timeout))
RuntimeError: The simulation took longer than 600.0s to update

At the same time, I find memory usage of some GPU fall down, see GPU 5 and 6:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06   Driver Version: 525.125.06   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:3F:00.0 Off |                    0 |
| N/A   43C    P0    70W / 300W |   9399MiB / 32768MiB |     14%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:40:00.0 Off |                    0 |
| N/A   41C    P0    68W / 300W |   9097MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:41:00.0 Off |                    0 |
| N/A   57C    P0   244W / 300W |   8067MiB / 32768MiB |     99%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000000:42:00.0 Off |                    0 |
| N/A   48C    P0    71W / 300W |  10219MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  Tesla V100-SXM2...  On   | 00000000:62:00.0 Off |                    0 |
| N/A   51C    P0   264W / 300W |  10344MiB / 32768MiB |     99%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-SXM2...  On   | 00000000:63:00.0 Off |                    0 |
| N/A   35C    P0    55W / 300W |   4344MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM2...  On   | 00000000:64:00.0 Off |                    0 |
| N/A   42C    P0    59W / 300W |   3828MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   7  Tesla V100-SXM2...  On   | 00000000:65:00.0 Off |                    0 |
| N/A   49C    P0    72W / 300W |   8732MiB / 32768MiB |     36%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Also, only 8 json files are recorded. No more json is updated. I guess something wrong happens when loading new route after finishing one?

The text was updated successfully, but these errors were encountered:

jayyoung0802 · 2024-10-21T08:43:09Z

Hi @Eden-Wang1710, Carla doesn't work well. Please refer to #32 and #111 .

Eden-Wang1710 · 2024-10-21T08:46:54Z

Carla doesn't work well. Please refer to #32.

您好，我这个是已经运行了两个小时了，8个json里有4个显示complete了，保存图像的文件夹也有12个了，也都有图片保存进去。
我用的是V100不是A800，并且我单独跑debug那个脚本的时候没有任何问题

jayyoung0802 · 2024-10-21T08:48:47Z

Good, just resume it, please refer to #89.

Eden-Wang1710 · 2024-10-21T08:55:16Z

Good, just resume it, please refer to #89.

谢谢，请问用8卡跑，结果会保存在8个json里，还是220个json里？

jayyoung0802 · 2024-10-21T08:57:14Z

8 jsons, 220 will be divided into 8 equal parts.

Eden-Wang1710 · 2024-10-21T08:59:14Z

8 jsons, 220 will be divided into 8 equal parts.

谢谢，这个progress的意思是，每个卡complete了多少route，和分到多少route吗？

jayyoung0802 · 2024-10-21T09:01:05Z

Yes, you are right.

Eden-Wang1710 · 2024-10-21T09:02:36Z

Yes, you are right.

weird，我八个卡加起来是148route

jayyoung0802 · 2024-10-21T09:07:07Z

split_xml.py is very clear and concise. Please check split_xml.py.

Eden-Wang1710 · 2024-10-22T02:52:21Z

split_xml.py is very clear and concise. Please check split_xml.py.

Thanks! But why do we directly set the task number here to 12? In my understanding, it should follow the TASK_LIST.
https://github.com/Thinklab-SJTU/Bench2Drive/blob/main/leaderboard/scripts/run_evaluation_multi_uniad.sh#L25

Also, you recommend GPU: Task(1:2), but the current code here (to allocate the route for each GPU) seems use 1:1 as default. Do we need to modify if we want to run 2 tasks for each GPU?
https://github.com/Thinklab-SJTU/Bench2Drive/blob/main/leaderboard/scripts/run_evaluation_multi_uniad.sh#L45

Thanks a gain for your kindly help!

jayyoung0802 · 2024-10-22T04:37:17Z

@Eden-Wang1710 Thanks for finding the typo. The task number should be 8 and equal to len(TASK_LIST). I will fix it in the next update.

Eden-Wang1710 · 2024-10-23T02:15:59Z

Hi, after resuming run_evaluation_multi_uniad.sh, some GPU perform well and finish their tasks.
But for the last three GPU (index 5 6 7), although they play for some time, they still encounter this error. And they fail to finish the first route.
Could you please tell me the reason of such error? Is this caused by CARLA disconnection, or car's bad behavior cause bugs? Thanks!

GPU 5:

=== [Agent] -- Wallclock = 2024-10-22 20:57:10.791 -- System time = 3476.614 -- Game time = 68.050 -- Ratio = 0.020x
=== [Agent] -- Wallclock = 2024-10-22 20:57:13.365 -- System time = 3479.188 -- Game time = 68.100 -- Ratio = 0.020x
=== [Agent] -- Wallclock = 2024-10-22 20:57:16.011 -- System time = 3481.834 -- Game time = 68.150 -- Ratio = 0.020x
CommonUnixCrashHandler: Signal=11
Engine crash handling finished; re-raising signal 11 for the default handler. Good bye.

�[91mError during the simulation:
Watchdog exception - Timeout of 601.0 seconds occured

GPU 6:

=== [Agent] -- Wallclock = 2024-10-22 21:37:06.721 -- System time = 5890.964 -- Game time = 96.800 -- Ratio = 0.016x
CommonUnixCrashHandler: Signal=11
Engine crash handling finished; re-raising signal 11 for the default handler. Good bye.

�[91mError during the simulation:
Watchdog exception - Timeout of 601.0 seconds occured

GPU 7:

=== [Agent] -- Wallclock = 2024-10-22 22:55:56.980 -- System time = 10550.552 -- Game time = 149.650 -- Ratio = 0.014x
=== [Agent] -- Wallclock = 2024-10-22 22:56:00.573 -- System time = 10554.146 -- Game time = 149.700 -- Ratio = 0.014x
CommonUnixCrashHandler: Signal=11
Engine crash handling finished; re-raising signal 11 for the default handler. Good bye.
Watchdog exception - Timeout of 601.0 seconds occured

jayyoung0802 · 2024-10-23T05:59:38Z

You can comment these crashed route in xml file, and resume it #89 . These route may be crashed by agent behavior.

Eden-Wang1710 · 2024-10-23T06:06:44Z

You can comment these crashed route in xml file, and resume it #89 . These route may be crashed by agent behavior.

Yes, I check the saved images, the route is crashed by agent behavior. Now I'm trying to modify the script, so that it can automatically resume the program and skip the crashed route. It's complicated to rusume and comment manually.

jayyoung0802 · 2024-10-23T06:33:23Z

For safety reasons, we operate them manually.

Eden-Wang1710 · 2024-10-23T06:36:04Z

For safety reasons, we operate them manually.

Roughly how many resumes and comments do you usually operate for one model?

jayyoung0802 · 2024-10-23T06:41:25Z

Our eval json is open source, please check eval json.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error when running run_evaluation_multi_uniad.sh #114

Error when running run_evaluation_multi_uniad.sh #114

Eden-Wang1710 commented Oct 21, 2024 •

edited

Loading

jayyoung0802 commented Oct 21, 2024 •

edited

Loading

Eden-Wang1710 commented Oct 21, 2024

jayyoung0802 commented Oct 21, 2024

Eden-Wang1710 commented Oct 21, 2024

jayyoung0802 commented Oct 21, 2024

Eden-Wang1710 commented Oct 21, 2024

jayyoung0802 commented Oct 21, 2024

Eden-Wang1710 commented Oct 21, 2024 •

edited

Loading

jayyoung0802 commented Oct 21, 2024

Eden-Wang1710 commented Oct 22, 2024 •

edited

Loading

jayyoung0802 commented Oct 22, 2024

Eden-Wang1710 commented Oct 23, 2024 •

edited

Loading

jayyoung0802 commented Oct 23, 2024 •

edited

Loading

Eden-Wang1710 commented Oct 23, 2024

jayyoung0802 commented Oct 23, 2024

Eden-Wang1710 commented Oct 23, 2024

jayyoung0802 commented Oct 23, 2024

Error when running run_evaluation_multi_uniad.sh #114

Error when running run_evaluation_multi_uniad.sh #114

Comments

Eden-Wang1710 commented Oct 21, 2024 • edited Loading

jayyoung0802 commented Oct 21, 2024 • edited Loading

Eden-Wang1710 commented Oct 21, 2024

jayyoung0802 commented Oct 21, 2024

Eden-Wang1710 commented Oct 21, 2024

jayyoung0802 commented Oct 21, 2024

Eden-Wang1710 commented Oct 21, 2024

jayyoung0802 commented Oct 21, 2024

Eden-Wang1710 commented Oct 21, 2024 • edited Loading

jayyoung0802 commented Oct 21, 2024

Eden-Wang1710 commented Oct 22, 2024 • edited Loading

jayyoung0802 commented Oct 22, 2024

Eden-Wang1710 commented Oct 23, 2024 • edited Loading

jayyoung0802 commented Oct 23, 2024 • edited Loading

Eden-Wang1710 commented Oct 23, 2024

jayyoung0802 commented Oct 23, 2024

Eden-Wang1710 commented Oct 23, 2024

jayyoung0802 commented Oct 23, 2024

Eden-Wang1710 commented Oct 21, 2024 •

edited

Loading

jayyoung0802 commented Oct 21, 2024 •

edited

Loading

Eden-Wang1710 commented Oct 21, 2024 •

edited

Loading

Eden-Wang1710 commented Oct 22, 2024 •

edited

Loading

Eden-Wang1710 commented Oct 23, 2024 •

edited

Loading

jayyoung0802 commented Oct 23, 2024 •

edited

Loading