Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when running run_evaluation_multi_uniad.sh #114

Open
Eden-Wang1710 opened this issue Oct 21, 2024 · 17 comments
Open

Error when running run_evaluation_multi_uniad.sh #114

Eden-Wang1710 opened this issue Oct 21, 2024 · 17 comments

Comments

@Eden-Wang1710
Copy link

Eden-Wang1710 commented Oct 21, 2024

Hi, sorry for bothering you again.

When running run_evaluation_multi_uniad.sh on 8 GPU for 2 hours, I encounter this error:

Malloc Size=65538 LargeMemoryPoolOffset=65554 
/home/work/Bench2Drive/carla/CarlaUE4.sh: line 5:  7267 Segmentation fault      (core dumped) "$UE4_PROJECT_ROOT/CarlaUE4/Binaries/Linux/CarlaUE4-Linux-Shipping" CarlaUE4 "$@"
Signal 11 caught.
Malloc Size=65538 LargeMemoryPoolOffset=65554 
/home/work/Bench2Drive/carla/CarlaUE4.sh: line 5:  6650 Segmentation fault      (core dumped) "$UE4_PROJECT_ROOT/CarlaUE4/Binaries/Linux/CarlaUE4-Linux-Shipping" CarlaUE4 "$@"
/opt/conda/envs/b2d_zoo/lib/python3.8/site-packages/scipy/optimize/_minpack_py.py:178: RuntimeWarning: The iteration is not making good progress, as measured by the 
  improvement from the last ten iterations.
  warnings.warn(msg, RuntimeWarning)
/home/work/Bench2Drive/Bench2DriveZoo/mmcv/models/modules/transformer.py:135: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:278.)
  shift = bev_queries.new_tensor(
/home/work/Bench2Drive/Bench2DriveZoo/mmcv/core/bbox/coder/detr3d_track_coder.py:94: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  self.post_center_range = torch.tensor(
Exception in thread Thread-2:
Traceback (most recent call last):
  File "/opt/conda/envs/b2d_zoo/lib/python3.8/threading.py", line 932, in _bootstrap_inner
Traceback (most recent call last):
  File "leaderboard/leaderboard/leaderboard_evaluator.py", line 416, in _load_and_run_scenario
    self.run()
  File "/opt/conda/envs/b2d_zoo/lib/python3.8/threading.py", line 870, in run
    self.manager.run_scenario()
  File "/home/work/Bench2Drive/leaderboard/leaderboard/scenarios/scenario_manager.py", line 161, in run_scenario
    self._target(*self._args, **self._kwargs)
  File "/home/work/Bench2Drive/leaderboard/leaderboard/scenarios/scenario_manager.py", line 136, in build_scenarios_loop
    self._tick_scenario()
  File "/home/work/Bench2Drive/leaderboard/leaderboard/scenarios/scenario_manager.py", line 168, in _tick_scenario
    self.scenario.spawn_parked_vehicles(self.ego_vehicles[0])
  File "/home/work/Bench2Drive/leaderboard/leaderboard/scenarios/route_scenario.py", line 236, in spawn_parked_vehicles
    CarlaDataProvider.get_world().tick(self._timeout)
RuntimeError: time-out of 600000ms while waiting for the simulator, make sure the simulator is ready and connected to localhost:40900

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "leaderboard/leaderboard/leaderboard_evaluator.py", line 568, in <module>
    main()
  File "leaderboard/leaderboard/leaderboard_evaluator.py", line 558, in main
    for response in CarlaDataProvider.get_client().apply_batch_sync(new_parked_vehicles):
RuntimeError: time-out of 600000ms while waiting for the simulator, make sure the simulator is ready and connected to localhost:40900
    crashed = leaderboard_evaluator.run(arguments)
  File "leaderboard/leaderboard/leaderboard_evaluator.py", line 481, in run
    crashed = self._load_and_run_scenario(args, config)
  File "leaderboard/leaderboard/leaderboard_evaluator.py", line 416, in _load_and_run_scenario
    self.manager.run_scenario()
  File "leaderboard/leaderboard/leaderboard_evaluator.py", line 146, in _signal_handler
    self.manager.signal_handler(signum, frame)
  File "/home/work/Bench2Drive/leaderboard/leaderboard/scenarios/scenario_manager.py", line 89, in signal_handler
    raise RuntimeError("The simulation took longer than {}s to update".format(self._timeout))
RuntimeError: The simulation took longer than 600.0s to update
Exception in thread Thread-4:
Traceback (most recent call last):
  File "/opt/conda/envs/b2d_zoo/lib/python3.8/threading.py", line 932, in _bootstrap_inner
Traceback (most recent call last):
  File "leaderboard/leaderboard/leaderboard_evaluator.py", line 416, in _load_and_run_scenario
    self.run()
  File "/opt/conda/envs/b2d_zoo/lib/python3.8/threading.py", line 870, in run
    self.manager.run_scenario()
  File "/home/work/Bench2Drive/leaderboard/leaderboard/scenarios/scenario_manager.py", line 161, in run_scenario
    self._target(*self._args, **self._kwargs)
  File "/home/work/Bench2Drive/leaderboard/leaderboard/scenarios/scenario_manager.py", line 136, in build_scenarios_loop
    self._tick_scenario()
  File "/home/work/Bench2Drive/leaderboard/leaderboard/scenarios/scenario_manager.py", line 168, in _tick_scenario
    self.scenario.spawn_parked_vehicles(self.ego_vehicles[0])
  File "/home/work/Bench2Drive/leaderboard/leaderboard/scenarios/route_scenario.py", line 236, in spawn_parked_vehicles
    CarlaDataProvider.get_world().tick(self._timeout)
RuntimeError: time-out of 600000ms while waiting for the simulator, make sure the simulator is ready and connected to localhost:40750

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "leaderboard/leaderboard/leaderboard_evaluator.py", line 568, in <module>
    for response in CarlaDataProvider.get_client().apply_batch_sync(new_parked_vehicles):
RuntimeError: time-out of 600000ms while waiting for the simulator, make sure the simulator is ready and connected to localhost:40750
    main()
  File "leaderboard/leaderboard/leaderboard_evaluator.py", line 558, in main
    crashed = leaderboard_evaluator.run(arguments)
  File "leaderboard/leaderboard/leaderboard_evaluator.py", line 481, in run
    crashed = self._load_and_run_scenario(args, config)
  File "leaderboard/leaderboard/leaderboard_evaluator.py", line 432, in _load_and_run_scenario
    print("\n\033[91mError during the simulation:", flush=True)
  File "leaderboard/leaderboard/leaderboard_evaluator.py", line 146, in _signal_handler
    self.manager.signal_handler(signum, frame)
  File "/home/work/Bench2Drive/leaderboard/leaderboard/scenarios/scenario_manager.py", line 89, in signal_handler
    raise RuntimeError("The simulation took longer than {}s to update".format(self._timeout))
RuntimeError: The simulation took longer than 600.0s to update

At the same time, I find memory usage of some GPU fall down, see GPU 5 and 6:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06   Driver Version: 525.125.06   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:3F:00.0 Off |                    0 |
| N/A   43C    P0    70W / 300W |   9399MiB / 32768MiB |     14%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:40:00.0 Off |                    0 |
| N/A   41C    P0    68W / 300W |   9097MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:41:00.0 Off |                    0 |
| N/A   57C    P0   244W / 300W |   8067MiB / 32768MiB |     99%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000000:42:00.0 Off |                    0 |
| N/A   48C    P0    71W / 300W |  10219MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  Tesla V100-SXM2...  On   | 00000000:62:00.0 Off |                    0 |
| N/A   51C    P0   264W / 300W |  10344MiB / 32768MiB |     99%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-SXM2...  On   | 00000000:63:00.0 Off |                    0 |
| N/A   35C    P0    55W / 300W |   4344MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM2...  On   | 00000000:64:00.0 Off |                    0 |
| N/A   42C    P0    59W / 300W |   3828MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   7  Tesla V100-SXM2...  On   | 00000000:65:00.0 Off |                    0 |
| N/A   49C    P0    72W / 300W |   8732MiB / 32768MiB |     36%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Also, only 8 json files are recorded. No more json is updated. I guess something wrong happens when loading new route after finishing one?
截屏2024-10-21 16 22 44

@jayyoung0802
Copy link
Member

jayyoung0802 commented Oct 21, 2024

Hi @Eden-Wang1710, Carla doesn't work well. Please refer to #32 and #111 .

@Eden-Wang1710
Copy link
Author

Carla doesn't work well. Please refer to #32.

您好,我这个是已经运行了两个小时了,8个json里有4个显示complete了,保存图像的文件夹也有12个了,也都有图片保存进去。
我用的是V100不是A800,并且我单独跑debug那个脚本的时候没有任何问题

@jayyoung0802
Copy link
Member

Good, just resume it, please refer to #89.

@Eden-Wang1710
Copy link
Author

Good, just resume it, please refer to #89.

谢谢,请问用8卡跑,结果会保存在8个json里,还是220个json里?

@jayyoung0802
Copy link
Member

8 jsons, 220 will be divided into 8 equal parts.

@Eden-Wang1710
Copy link
Author

8 jsons, 220 will be divided into 8 equal parts.

谢谢,这个progress的意思是,每个卡complete了多少route,和分到多少route吗?
image

@jayyoung0802
Copy link
Member

Yes, you are right.

@Eden-Wang1710
Copy link
Author

Eden-Wang1710 commented Oct 21, 2024

Yes, you are right.

weird,我八个卡加起来是148route

@jayyoung0802
Copy link
Member

split_xml.py is very clear and concise. Please check split_xml.py.

@Eden-Wang1710
Copy link
Author

Eden-Wang1710 commented Oct 22, 2024

split_xml.py is very clear and concise. Please check split_xml.py.

Thanks! But why do we directly set the task number here to 12? In my understanding, it should follow the TASK_LIST.
https://github.com/Thinklab-SJTU/Bench2Drive/blob/main/leaderboard/scripts/run_evaluation_multi_uniad.sh#L25

Also, you recommend GPU: Task(1:2), but the current code here (to allocate the route for each GPU) seems use 1:1 as default. Do we need to modify if we want to run 2 tasks for each GPU?
https://github.com/Thinklab-SJTU/Bench2Drive/blob/main/leaderboard/scripts/run_evaluation_multi_uniad.sh#L45

Thanks a gain for your kindly help!

@jayyoung0802
Copy link
Member

@Eden-Wang1710 Thanks for finding the typo. The task number should be 8 and equal to len(TASK_LIST). I will fix it in the next update.

@Eden-Wang1710
Copy link
Author

Eden-Wang1710 commented Oct 23, 2024

Hi, after resuming run_evaluation_multi_uniad.sh, some GPU perform well and finish their tasks.
But for the last three GPU (index 5 6 7), although they play for some time, they still encounter this error. And they fail to finish the first route.
Could you please tell me the reason of such error? Is this caused by CARLA disconnection, or car's bad behavior cause bugs? Thanks!

GPU 5:

=== [Agent] -- Wallclock = 2024-10-22 20:57:10.791 -- System time = 3476.614 -- Game time = 68.050 -- Ratio = 0.020x
=== [Agent] -- Wallclock = 2024-10-22 20:57:13.365 -- System time = 3479.188 -- Game time = 68.100 -- Ratio = 0.020x
=== [Agent] -- Wallclock = 2024-10-22 20:57:16.011 -- System time = 3481.834 -- Game time = 68.150 -- Ratio = 0.020x
CommonUnixCrashHandler: Signal=11
Engine crash handling finished; re-raising signal 11 for the default handler. Good bye.

�[91mError during the simulation:
Watchdog exception - Timeout of 601.0 seconds occured

GPU 6:

=== [Agent] -- Wallclock = 2024-10-22 21:37:06.721 -- System time = 5890.964 -- Game time = 96.800 -- Ratio = 0.016x
CommonUnixCrashHandler: Signal=11
Engine crash handling finished; re-raising signal 11 for the default handler. Good bye.

�[91mError during the simulation:
Watchdog exception - Timeout of 601.0 seconds occured

GPU 7:

=== [Agent] -- Wallclock = 2024-10-22 22:55:56.980 -- System time = 10550.552 -- Game time = 149.650 -- Ratio = 0.014x
=== [Agent] -- Wallclock = 2024-10-22 22:56:00.573 -- System time = 10554.146 -- Game time = 149.700 -- Ratio = 0.014x
CommonUnixCrashHandler: Signal=11
Engine crash handling finished; re-raising signal 11 for the default handler. Good bye.
Watchdog exception - Timeout of 601.0 seconds occured

@jayyoung0802
Copy link
Member

jayyoung0802 commented Oct 23, 2024

You can comment these crashed route in xml file, and resume it #89 . These route may be crashed by agent behavior.

@Eden-Wang1710
Copy link
Author

You can comment these crashed route in xml file, and resume it #89 . These route may be crashed by agent behavior.

Yes, I check the saved images, the route is crashed by agent behavior. Now I'm trying to modify the script, so that it can automatically resume the program and skip the crashed route. It's complicated to rusume and comment manually.

@jayyoung0802
Copy link
Member

For safety reasons, we operate them manually.

@Eden-Wang1710
Copy link
Author

For safety reasons, we operate them manually.

Roughly how many resumes and comments do you usually operate for one model?

@jayyoung0802
Copy link
Member

Our eval json is open source, please check eval json.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants