Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running the maddpg benchmark for a long time results in a traci error. #158

Closed
rsuwa opened this issue Nov 13, 2020 · 3 comments · May be fixed by #619
Closed

Running the maddpg benchmark for a long time results in a traci error. #158

rsuwa opened this issue Nov 13, 2020 · 3 comments · May be fixed by #619

Comments

@rsuwa
Copy link

rsuwa commented Nov 13, 2020

Reproduction

If you run the task for a long time (1-2 hours), there is a high probability that the following error will occur.

Command

python run.py scenarios/intersections/4lane -f agents/maddpg/baseline-lane-control.yaml

Full logs

Failure # 1 (occurred at 2020-11-13_14-32-54)
Traceback (most recent call last):
  File "/home/ryota/src/github.com/huawei-noah/SMARTS/benchmark/.venv/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 726, in _process_trial
    result = self.trial_executor.fetch_result(trial)
  File "/home/ryota/src/github.com/huawei-noah/SMARTS/benchmark/.venv/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 489, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ryota/src/github.com/huawei-noah/SMARTS/benchmark/.venv/lib/python3.7/site-packages/ray/worker.py", line 1452, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(FatalTraCIError): �[36mray::MADDPG2.train()�[39m (pid=8802, ip=192.168.10.106)
  File "python/ray/_raylet.pyx", line 482, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 436, in ray._raylet.execute_task.function_executor
  File "/home/ryota/src/github.com/huawei-noah/SMARTS/benchmark/.venv/lib/python3.7/site-packages/ray/rllib/agents/trainer.py", line 517, in train
    raise e
  File "/home/ryota/src/github.com/huawei-noah/SMARTS/benchmark/.venv/lib/python3.7/site-packages/ray/rllib/agents/trainer.py", line 506, in train
    result = Trainable.train(self)
  File "/home/ryota/src/github.com/huawei-noah/SMARTS/benchmark/.venv/lib/python3.7/site-packages/ray/tune/trainable.py", line 336, in train
    result = self.step()
  File "/home/ryota/src/github.com/huawei-noah/SMARTS/benchmark/.venv/lib/python3.7/site-packages/ray/rllib/agents/trainer_template.py", line 147, in step
    res = next(self.train_exec_impl)
  File "/home/ryota/src/github.com/huawei-noah/SMARTS/benchmark/.venv/lib/python3.7/site-packages/ray/util/iter.py", line 756, in __next__
    return next(self.built_iterator)
  File "/home/ryota/src/github.com/huawei-noah/SMARTS/benchmark/.venv/lib/python3.7/site-packages/ray/util/iter.py", line 783, in apply_foreach
    for item in it:
  File "/home/ryota/src/github.com/huawei-noah/SMARTS/benchmark/.venv/lib/python3.7/site-packages/ray/util/iter.py", line 843, in apply_filter
    for item in it:
  File "/home/ryota/src/github.com/huawei-noah/SMARTS/benchmark/.venv/lib/python3.7/site-packages/ray/util/iter.py", line 843, in apply_filter
    for item in it:
  File "/home/ryota/src/github.com/huawei-noah/SMARTS/benchmark/.venv/lib/python3.7/site-packages/ray/util/iter.py", line 783, in apply_foreach
    for item in it:
  File "/home/ryota/src/github.com/huawei-noah/SMARTS/benchmark/.venv/lib/python3.7/site-packages/ray/util/iter.py", line 843, in apply_filter
    for item in it:
  File "/home/ryota/src/github.com/huawei-noah/SMARTS/benchmark/.venv/lib/python3.7/site-packages/ray/util/iter.py", line 1075, in build_union
    item = next(it)
  File "/home/ryota/src/github.com/huawei-noah/SMARTS/benchmark/.venv/lib/python3.7/site-packages/ray/util/iter.py", line 756, in __next__
    return next(self.built_iterator)
  File "/home/ryota/src/github.com/huawei-noah/SMARTS/benchmark/.venv/lib/python3.7/site-packages/ray/util/iter.py", line 783, in apply_foreach
    for item in it:
  File "/home/ryota/src/github.com/huawei-noah/SMARTS/benchmark/.venv/lib/python3.7/site-packages/ray/util/iter.py", line 783, in apply_foreach
    for item in it:
  File "/home/ryota/src/github.com/huawei-noah/SMARTS/benchmark/.venv/lib/python3.7/site-packages/ray/util/iter.py", line 783, in apply_foreach
    for item in it:
  [Previous line repeated 1 more time]
  File "/home/ryota/src/github.com/huawei-noah/SMARTS/benchmark/.venv/lib/python3.7/site-packages/ray/util/iter.py", line 471, in base_iterator
    yield ray.get(futures, timeout=timeout)
ray.exceptions.RayTaskError(FatalTraCIError): �[36mray::RolloutWorker.par_iter_next()�[39m (pid=8801, ip=192.168.10.106)
  File "python/ray/_raylet.pyx", line 482, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 436, in ray._raylet.execute_task.function_executor
  File "/home/ryota/src/github.com/huawei-noah/SMARTS/benchmark/.venv/lib/python3.7/site-packages/ray/util/iter.py", line 1152, in par_iter_next
    return next(self.local_it)
  File "/home/ryota/src/github.com/huawei-noah/SMARTS/benchmark/.venv/lib/python3.7/site-packages/ray/rllib/evaluation/rollout_worker.py", line 317, in gen_rollouts
    yield self.sample()
  File "/home/ryota/src/github.com/huawei-noah/SMARTS/benchmark/.venv/lib/python3.7/site-packages/ray/rllib/evaluation/rollout_worker.py", line 621, in sample
    batches = [self.input_reader.next()]
  File "/home/ryota/src/github.com/huawei-noah/SMARTS/benchmark/.venv/lib/python3.7/site-packages/ray/rllib/evaluation/sampler.py", line 94, in next
    batches = [self.get_data()]
  File "/home/ryota/src/github.com/huawei-noah/SMARTS/benchmark/.venv/lib/python3.7/site-packages/ray/rllib/evaluation/sampler.py", line 211, in get_data
    item = next(self.rollout_provider)
  File "/home/ryota/src/github.com/huawei-noah/SMARTS/benchmark/.venv/lib/python3.7/site-packages/ray/rllib/evaluation/sampler.py", line 602, in _env_runner
    observation_fn=observation_fn,
  File "/home/ryota/src/github.com/huawei-noah/SMARTS/benchmark/.venv/lib/python3.7/site-packages/ray/rllib/evaluation/sampler.py", line 896, in _process_observations
    env_id)
  File "/home/ryota/src/github.com/huawei-noah/SMARTS/benchmark/.venv/lib/python3.7/site-packages/ray/rllib/env/base_env.py", line 422, in try_reset
    obs = self.env_states[env_id].reset()
  File "/home/ryota/src/github.com/huawei-noah/SMARTS/benchmark/.venv/lib/python3.7/site-packages/ray/rllib/env/base_env.py", line 460, in reset
    self.last_obs = self.env.reset()
  File "/home/ryota/src/github.com/huawei-noah/SMARTS/benchmark/wrappers/rllib/early_done.py", line 34, in reset
    obs = self.env.reset()
  File "/home/ryota/src/github.com/huawei-noah/SMARTS/smarts/env/rllib_hiway_env.py", line 158, in reset
    env_observations = self._smarts.reset(scenario)
  File "/home/ryota/src/github.com/huawei-noah/SMARTS/smarts/core/smarts.py", line 270, in reset
    self.setup(scenario)
  File "/home/ryota/src/github.com/huawei-noah/SMARTS/smarts/core/smarts.py", line 318, in setup
    provider_state = self._setup_providers(self._scenario)
  File "/home/ryota/src/github.com/huawei-noah/SMARTS/smarts/core/smarts.py", line 600, in _setup_providers
    provider_state.merge(provider.setup(scenario))
  File "/home/ryota/src/github.com/huawei-noah/SMARTS/smarts/core/sumo_traffic_simulation.py", line 238, in setup
    [tc.VAR_DEPARTED_VEHICLES_IDS, tc.VAR_ARRIVED_VEHICLES_IDS]
  File "/usr/share/sumo/tools/traci/_simulation.py", line 440, in subscribe
    Domain.subscribe(self, "", varIDs, begin, end)
  File "/usr/share/sumo/tools/traci/domain.py", line 208, in subscribe
    self._connection._subscribe(self._subscribeID, begin, end, objectID, varIDs)
  File "/usr/share/sumo/tools/traci/connection.py", line 231, in _subscribe
    result = self._sendCmd(cmdID, (begin, end), objID, format, *args)
  File "/usr/share/sumo/tools/traci/connection.py", line 178, in _sendCmd
    return self._sendExact()
  File "/usr/share/sumo/tools/traci/connection.py", line 88, in _sendExact
    raise FatalTraCIError("connection closed by SUMO")
traci.exceptions.FatalTraCIError: connection closed by SUMO
@KornbergFresnel
Copy link
Contributor

@rsuwa Yes, it is an error produced by SUMO. You can read some related issue reports like this one to get more details. It worth noting that this problem will not affect your training. You can set a large max_failures and decrease the number of workers to ensure plenty of samples can be collected, also reduce the probability of raising TraCIError:

analysis = tune.run(
        "PG",
        # ...
        max_failures=3,
        # ...
    )

@rsuwa
Copy link
Author

rsuwa commented Nov 14, 2020

@KornbergFresnel
Okay, I'll check it out.
However, if an error occurs, then I can't come back.
It also appears to reset the number of steps.

Screenshot 2020-11-14 14:02:33

@Gamenot Gamenot added this to the Backlog milestone Jan 27, 2021
@Adaickalavan Adaickalavan linked a pull request Mar 18, 2021 that will close this issue
@Adaickalavan
Copy link
Member

Given the graceful handling of traci connection errors in latest SMARTS version, this issue is being closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants