Graceful handle of TraCI connection errors #1138

Gamenot · 2021-11-25T21:57:44Z

This is an attempt to gracefully handle TraCI connection errors until the root issue is fixed.

See(for example):
#158
#295
#803

smarts/core/bubble_manager.py

Adaickalavan · 2021-11-26T03:38:35Z

What is the desired course of action when TraCI error happens (e.g., terminate, restart connection)?

Gamenot · 2021-11-26T19:29:28Z

What is the desired course of action when TraCI error happens (e.g., terminate, restart connection)?

@Adaickalavan From conversation, the intention would be to gracefully end the episode.

smarts/core/sumo_traffic_simulation.py

RutvikGupta

I think this should be now rebased on develop since its an important change.

smarts/core/provider.py

Makefile

smarts/core/provider.py

smarts/core/sumo_traffic_simulation.py

smarts/core/provider.py

Gamenot · 2021-12-11T02:56:48Z

Odd that the notebook test failed since it worked locally. I know the github workers seem to be having issues.

Gamenot · 2021-12-30T18:32:56Z

smarts/core/agent_manager.py

+        if sim.should_reset:
+            dones = {agent_id: True for agent_id in self.agent_ids}
+            dones["__sim__"] = True


The idea here is that we could treat the sim as also having the possibility to be done.

It can come through the dones from SMARTS; however, at the gym level it would need to be translated.

Gamenot · 2021-12-30T19:49:04Z

smarts/core/bubble_manager.py

+        if len(route) > 0:
+            goal = PositionalGoal.from_road(route[-1], sim.scenario.road_map)
+        else:
+            goal = EndlessGoal()


I believe this was a bug with a 0 length route being considered a positional goal when there is no end edge.

Gamenot · 2021-12-30T19:50:46Z

smarts/core/smarts.py

+                recovery_flags=ProviderRecoveryFlags.EPISODE_REQUIRED
+                | ProviderRecoveryFlags.ATTEMPT_RECOVERY,


It is possible to choose if the provider will attempt to recover from an error by specifying flags when adding the provider to the simulation.

Gamenot · 2021-12-30T19:51:49Z

smarts/core/smarts.py

+    def add_provider(
+        self,
+        provider: Provider,
+        recovery_flags: ProviderRecoveryFlags = ProviderRecoveryFlags.EXPERIMENT_REQUIRED,
+    ):
        assert isinstance(provider, Provider)
        self._providers.append(provider)
+        self._provider_recovery_flags[provider] = recovery_flags


The simulation keeps track of if the provider should attempt recovery.

Gamenot · 2021-12-30T19:56:13Z

smarts/core/smarts.py

+        recovery_flags = self._provider_recovery_flags.get(
+            provider, ProviderRecoveryFlags.EXPERIMENT_REQUIRED
+        )


The provider will default to being required in the simulation.

Gamenot · 2021-12-30T19:57:38Z

smarts/core/smarts.py

+            elif recovery_flags & ProviderRecoveryFlags.EXPERIMENT_REQUIRED:
+                raise provider_error


The meaning of EXPERIMENT_REQUIRED is that it will re-raise the error it had if it cannot (or will not recover) from the error.

Gamenot · 2021-12-31T02:10:34Z

smarts/env/hiway_env.py

@@ -203,12 +201,12 @@ def step(self, agent_actions):
            observations[agent_id] = agent_spec.observation_adapter(observation)
            infos[agent_id] = agent_spec.info_adapter(observation, reward, info)

-        for done in agent_dones.values():
+        for done in dones.values():


If I am placing "__sim__" in dones I am unsure how to translate that through the gym interface. I can just remove it since all the agents are set done, I wonder if it should be signalled through infos as well.

Yeah, I think setting all agents to done should be sufficient for gym.

Gamenot · 2021-12-31T02:25:15Z

smarts/core/smarts.py

+        tries = 2
+        first_exception = None
+        for _ in range(tries):
+            try:
+                self._resetting = True
+                return self._reset(scenario)
+            except Exception as e:
+                if not first_exception:
+                    first_exception = e
+            finally:
+                self._resetting = False
+        self._log.error(f"Failed to successfully reset after {tries} times.")
+        raise first_exception


I am somewhat iffy about this change intended to give a slight leeway in case an edge case occurs that breaks reset. Essentially, because we do not return dones or infos from this reset it is not easily possible to communicate to the user that reset failed for some reason.

There are two directions we can take with this:

The reset has the responsibility to give a working simulation state. (currently going with this)

The user has to determine if reset worked.

I think you're right to go with the first one.

RutvikGupta · 2021-12-31T18:39:52Z

smarts/core/smarts.py

+        tries = 2
+        first_exception = None


Whats the reason behind 2 tries?

Situational problems occur within the reset sometimes, I considered more than 1 retry but then there is no particular number that will work if the engine is failing. My reasoning here is that reset should not fail for an edge case on reset but I would say reset failure twice in a row indicates it is not an edge case.

Gamenot requested review from RutvikGupta and sah-huawei November 25, 2021 21:57

Gamenot changed the title ~~Graceful handle of traci connection errors~~ Graceful handle of TraCI connection errors Nov 25, 2021

RutvikGupta reviewed Nov 25, 2021

View reviewed changes

smarts/core/bubble_manager.py Outdated Show resolved Hide resolved

RutvikGupta requested a review from Adaickalavan November 25, 2021 22:30

RutvikGupta reviewed Nov 29, 2021

View reviewed changes

smarts/core/sumo_traffic_simulation.py Outdated Show resolved Hide resolved

RutvikGupta reviewed Nov 29, 2021

View reviewed changes

smarts/core/sumo_traffic_simulation.py Show resolved Hide resolved

RutvikGupta reviewed Nov 29, 2021

View reviewed changes

smarts/core/sumo_traffic_simulation.py Show resolved Hide resolved

RutvikGupta approved these changes Dec 1, 2021

View reviewed changes

RutvikGupta reviewed Dec 1, 2021

View reviewed changes

Gamenot changed the base branch from marl_benchmark_path_fixes to develop December 1, 2021 21:15

sah-huawei requested changes Dec 1, 2021

View reviewed changes

Gamenot force-pushed the bugfix-gracefully_handle_traci branch from ee706bd to 76fe450 Compare December 1, 2021 22:39

Gamenot requested a review from sah-huawei December 2, 2021 03:13

Gamenot added the release critical label Dec 21, 2021

Gamenot commented Dec 30, 2021

View reviewed changes

Gamenot commented Dec 31, 2021

View reviewed changes

Gamenot requested review from sah-huawei and removed request for sah-huawei December 31, 2021 02:11

Gamenot commented Dec 31, 2021

View reviewed changes

Gamenot requested a review from RutvikGupta December 31, 2021 18:26

RutvikGupta reviewed Dec 31, 2021

View reviewed changes

Gamenot added 26 commits January 6, 2022 15:16

Fix bugs with provider changes

18cc074

Add required property to provider

ed5f65d

Add provider error handling

79f0597

Add __sim__ done

0d15ccf

Format

60ebc4d

Update changelog

dcc6812

Fix missing changes

3ab1f1e

Add unsaved file

92fbea7

Push another unsaved file.

203c857

Document new methods

a0d9978

Move recovery flag configuration to SMARTS

57175d4

Apply suggestions

6d951d2

Fix provider not inheriting from Provider

1c519f8

Revert unimportant changes

2fcc59f

Note make test test_notebook timeout in CHANGELOG

68dbe98

Make sure notebook tests do not time out

482baec

Update changelog

ac4aeaa

Remove unnecessary EmptyProvider

d39aa38

Improve provider error handling

5a8ae79

Ensure that error handling is working

929a0f9

Fix issue that causes crash with TrapManager

c66c9a8

Fix SumoTrafficSimulation.recover definition

a3ec597

Add missing import

2ff5ef3

Handle SMARTS.reset(..) errors

4330279

Set default recover to re-raise exception

fabe209

Address comments.

35b0b92

Gamenot force-pushed the bugfix-gracefully_handle_traci branch from ff5f291 to 35b0b92 Compare January 6, 2022 20:38

Gamenot merged commit 4c9935e into develop Jan 6, 2022

Gamenot mentioned this pull request Jan 6, 2022

Setting up MARL Benchmark with SMARTS 0.6.1 #1126

Closed

Gamenot deleted the bugfix-gracefully_handle_traci branch April 19, 2022 13:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Graceful handle of TraCI connection errors #1138

Graceful handle of TraCI connection errors #1138

Gamenot commented Nov 25, 2021

Adaickalavan commented Nov 26, 2021

Gamenot commented Nov 26, 2021

RutvikGupta left a comment

Gamenot commented Dec 11, 2021

Gamenot Dec 30, 2021

Gamenot Dec 30, 2021

Gamenot Dec 30, 2021

Gamenot Dec 30, 2021

Gamenot Dec 30, 2021

Gamenot Dec 30, 2021

Gamenot Dec 30, 2021 •

edited

Loading

Gamenot Dec 31, 2021

sah-huawei Jan 1, 2022

Gamenot Dec 31, 2021

sah-huawei Jan 1, 2022

RutvikGupta Dec 31, 2021

Gamenot Jan 4, 2022 •

edited

Loading

		recovery_flags=ProviderRecoveryFlags.EPISODE_REQUIRED
		\| ProviderRecoveryFlags.ATTEMPT_RECOVERY,

		elif recovery_flags & ProviderRecoveryFlags.EXPERIMENT_REQUIRED:
		raise provider_error

Graceful handle of TraCI connection errors #1138

Graceful handle of TraCI connection errors #1138

Conversation

Gamenot commented Nov 25, 2021

Adaickalavan commented Nov 26, 2021

Gamenot commented Nov 26, 2021

RutvikGupta left a comment

Choose a reason for hiding this comment

Gamenot commented Dec 11, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Gamenot Dec 30, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Gamenot Jan 4, 2022 • edited Loading

Choose a reason for hiding this comment

Gamenot Dec 30, 2021 •

edited

Loading

Gamenot Jan 4, 2022 •

edited

Loading