Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graceful handle of TraCI connection errors #1138

Merged
merged 29 commits into from
Jan 6, 2022

Conversation

Gamenot
Copy link
Collaborator

@Gamenot Gamenot commented Nov 25, 2021

This is an attempt to gracefully handle TraCI connection errors until the root issue is fixed.

See(for example):
#158
#295
#803

@Gamenot Gamenot changed the title Graceful handle of traci connection errors Graceful handle of TraCI connection errors Nov 25, 2021
@Adaickalavan
Copy link
Member

What is the desired course of action when TraCI error happens (e.g., terminate, restart connection)?

@Gamenot
Copy link
Collaborator Author

Gamenot commented Nov 26, 2021

What is the desired course of action when TraCI error happens (e.g., terminate, restart connection)?

@Adaickalavan From conversation, the intention would be to gracefully end the episode.

Copy link
Contributor

@RutvikGupta RutvikGupta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be now rebased on develop since its an important change.

@Gamenot Gamenot changed the base branch from marl_benchmark_path_fixes to develop December 1, 2021 21:15
smarts/core/provider.py Outdated Show resolved Hide resolved
smarts/core/provider.py Outdated Show resolved Hide resolved
smarts/core/provider.py Outdated Show resolved Hide resolved
Makefile Outdated Show resolved Hide resolved
smarts/core/provider.py Outdated Show resolved Hide resolved
smarts/core/sumo_traffic_simulation.py Outdated Show resolved Hide resolved
smarts/core/sumo_traffic_simulation.py Outdated Show resolved Hide resolved
smarts/core/provider.py Outdated Show resolved Hide resolved
smarts/core/provider.py Outdated Show resolved Hide resolved
@Gamenot Gamenot force-pushed the bugfix-gracefully_handle_traci branch from ee706bd to 76fe450 Compare December 1, 2021 22:39
@Gamenot Gamenot requested a review from sah-huawei December 2, 2021 03:13
@Gamenot
Copy link
Collaborator Author

Gamenot commented Dec 11, 2021

Odd that the notebook test failed since it worked locally. I know the github workers seem to be having issues.

Comment on lines +216 to +218
if sim.should_reset:
dones = {agent_id: True for agent_id in self.agent_ids}
dones["__sim__"] = True
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea here is that we could treat the sim as also having the possibility to be done.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can come through the dones from SMARTS; however, at the gym level it would need to be translated.

Comment on lines +566 to +556
if len(route) > 0:
goal = PositionalGoal.from_road(route[-1], sim.scenario.road_map)
else:
goal = EndlessGoal()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this was a bug with a 0 length route being considered a positional goal when there is no end edge.

Comment on lines +128 to +131
recovery_flags=ProviderRecoveryFlags.EPISODE_REQUIRED
| ProviderRecoveryFlags.ATTEMPT_RECOVERY,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible to choose if the provider will attempt to recover from an error by specifying flags when adding the provider to the simulation.

Comment on lines 390 to 397
def add_provider(
self,
provider: Provider,
recovery_flags: ProviderRecoveryFlags = ProviderRecoveryFlags.EXPERIMENT_REQUIRED,
):
assert isinstance(provider, Provider)
self._providers.append(provider)
self._provider_recovery_flags[provider] = recovery_flags
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The simulation keeps track of if the provider should attempt recovery.

Comment on lines +830 to +885
recovery_flags = self._provider_recovery_flags.get(
provider, ProviderRecoveryFlags.EXPERIMENT_REQUIRED
)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The provider will default to being required in the simulation.

Comment on lines 840 to 902
elif recovery_flags & ProviderRecoveryFlags.EXPERIMENT_REQUIRED:
raise provider_error
Copy link
Collaborator Author

@Gamenot Gamenot Dec 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The meaning of EXPERIMENT_REQUIRED is that it will re-raise the error it had if it cannot (or will not recover) from the error.

@@ -203,12 +201,12 @@ def step(self, agent_actions):
observations[agent_id] = agent_spec.observation_adapter(observation)
infos[agent_id] = agent_spec.info_adapter(observation, reward, info)

for done in agent_dones.values():
for done in dones.values():
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I am placing "__sim__" in dones I am unsure how to translate that through the gym interface. I can just remove it since all the agents are set done, I wonder if it should be signalled through infos as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think setting all agents to done should be sufficient for gym.

@Gamenot Gamenot requested review from sah-huawei and removed request for sah-huawei December 31, 2021 02:11
Comment on lines +329 to +342
tries = 2
first_exception = None
for _ in range(tries):
try:
self._resetting = True
return self._reset(scenario)
except Exception as e:
if not first_exception:
first_exception = e
finally:
self._resetting = False
self._log.error(f"Failed to successfully reset after {tries} times.")
raise first_exception
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am somewhat iffy about this change intended to give a slight leeway in case an edge case occurs that breaks reset. Essentially, because we do not return dones or infos from this reset it is not easily possible to communicate to the user that reset failed for some reason.

There are two directions we can take with this:

  • The reset has the responsibility to give a working simulation state. (currently going with this)
  • The user has to determine if reset worked.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're right to go with the first one.

@Gamenot Gamenot requested a review from RutvikGupta December 31, 2021 18:26
Comment on lines +329 to +331
tries = 2
first_exception = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whats the reason behind 2 tries?

Copy link
Collaborator Author

@Gamenot Gamenot Jan 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Situational problems occur within the reset sometimes, I considered more than 1 retry but then there is no particular number that will work if the engine is failing. My reasoning here is that reset should not fail for an edge case on reset but I would say reset failure twice in a row indicates it is not an edge case.

@Gamenot Gamenot force-pushed the bugfix-gracefully_handle_traci branch from ff5f291 to 35b0b92 Compare January 6, 2022 20:38
@Gamenot Gamenot merged commit 4c9935e into develop Jan 6, 2022
@Gamenot Gamenot deleted the bugfix-gracefully_handle_traci branch April 19, 2022 13:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants