-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Graceful handle of TraCI connection errors #1138
Conversation
What is the desired course of action when TraCI error happens (e.g., terminate, restart connection)? |
@Adaickalavan From conversation, the intention would be to gracefully end the episode. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be now rebased on develop
since its an important change.
ee706bd
to
76fe450
Compare
Odd that the notebook test failed since it worked locally. I know the github workers seem to be having issues. |
if sim.should_reset: | ||
dones = {agent_id: True for agent_id in self.agent_ids} | ||
dones["__sim__"] = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea here is that we could treat the sim as also having the possibility to be done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can come through the dones
from SMARTS
; however, at the gym level it would need to be translated.
if len(route) > 0: | ||
goal = PositionalGoal.from_road(route[-1], sim.scenario.road_map) | ||
else: | ||
goal = EndlessGoal() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this was a bug with a 0 length route being considered a positional goal when there is no end edge.
recovery_flags=ProviderRecoveryFlags.EPISODE_REQUIRED | ||
| ProviderRecoveryFlags.ATTEMPT_RECOVERY, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is possible to choose if the provider will attempt to recover from an error by specifying flags when adding the provider to the simulation.
smarts/core/smarts.py
Outdated
def add_provider( | ||
self, | ||
provider: Provider, | ||
recovery_flags: ProviderRecoveryFlags = ProviderRecoveryFlags.EXPERIMENT_REQUIRED, | ||
): | ||
assert isinstance(provider, Provider) | ||
self._providers.append(provider) | ||
self._provider_recovery_flags[provider] = recovery_flags |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The simulation keeps track of if the provider should attempt recovery.
recovery_flags = self._provider_recovery_flags.get( | ||
provider, ProviderRecoveryFlags.EXPERIMENT_REQUIRED | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The provider will default to being required in the simulation.
smarts/core/smarts.py
Outdated
elif recovery_flags & ProviderRecoveryFlags.EXPERIMENT_REQUIRED: | ||
raise provider_error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The meaning of EXPERIMENT_REQUIRED
is that it will re-raise the error it had if it cannot (or will not recover) from the error.
@@ -203,12 +201,12 @@ def step(self, agent_actions): | |||
observations[agent_id] = agent_spec.observation_adapter(observation) | |||
infos[agent_id] = agent_spec.info_adapter(observation, reward, info) | |||
|
|||
for done in agent_dones.values(): | |||
for done in dones.values(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I am placing "__sim__"
in dones
I am unsure how to translate that through the gym interface. I can just remove it since all the agents are set done, I wonder if it should be signalled through infos
as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think setting all agents to done should be sufficient for gym.
tries = 2 | ||
first_exception = None | ||
for _ in range(tries): | ||
try: | ||
self._resetting = True | ||
return self._reset(scenario) | ||
except Exception as e: | ||
if not first_exception: | ||
first_exception = e | ||
finally: | ||
self._resetting = False | ||
self._log.error(f"Failed to successfully reset after {tries} times.") | ||
raise first_exception |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am somewhat iffy about this change intended to give a slight leeway in case an edge case occurs that breaks reset
. Essentially, because we do not return dones
or infos
from this reset
it is not easily possible to communicate to the user that reset
failed for some reason.
There are two directions we can take with this:
- The
reset
has the responsibility to give a working simulation state. (currently going with this) - The user has to determine if
reset
worked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're right to go with the first one.
tries = 2 | ||
first_exception = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whats the reason behind 2 tries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Situational problems occur within the reset sometimes, I considered more than 1 retry but then there is no particular number that will work if the engine is failing. My reasoning here is that reset should not fail for an edge case on reset
but I would say reset
failure twice in a row indicates it is not an edge case.
ff5f291
to
35b0b92
Compare
This is an attempt to gracefully handle TraCI connection errors until the root issue is fixed.
See(for example):
#158
#295
#803