Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graceful handle of TraCI connection errors #1138

Merged
merged 29 commits into from
Jan 6, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
457f48b
Graceful handle of traci connection errors
Gamenot Nov 25, 2021
6751f7d
Make format
Gamenot Nov 26, 2021
066d519
Fix check for empty route
Gamenot Nov 26, 2021
18cc074
Fix bugs with provider changes
Gamenot Nov 26, 2021
ed5f65d
Add required property to provider
Gamenot Nov 26, 2021
79f0597
Add provider error handling
Gamenot Nov 27, 2021
0d15ccf
Add `__sim__` done
Gamenot Nov 27, 2021
60ebc4d
Format
Gamenot Nov 27, 2021
dcc6812
Update changelog
Gamenot Nov 29, 2021
3ab1f1e
Fix missing changes
Gamenot Nov 29, 2021
92fbea7
Add unsaved file
Gamenot Nov 29, 2021
203c857
Push another unsaved file.
Gamenot Nov 29, 2021
a0d9978
Document new methods
Gamenot Dec 1, 2021
57175d4
Move recovery flag configuration to SMARTS
Gamenot Dec 1, 2021
6d951d2
Apply suggestions
Gamenot Dec 2, 2021
1c519f8
Fix provider not inheriting from Provider
Gamenot Dec 2, 2021
2fcc59f
Revert unimportant changes
Gamenot Dec 5, 2021
68dbe98
Note make test test_notebook timeout in CHANGELOG
Gamenot Dec 5, 2021
482baec
Make sure notebook tests do not time out
Gamenot Dec 5, 2021
ac4aeaa
Update changelog
Gamenot Dec 5, 2021
d39aa38
Remove unnecessary EmptyProvider
Gamenot Dec 30, 2021
5a8ae79
Improve provider error handling
Gamenot Dec 30, 2021
929a0f9
Ensure that error handling is working
Gamenot Dec 30, 2021
c66c9a8
Fix issue that causes crash with TrapManager
Gamenot Dec 31, 2021
a3ec597
Fix `SumoTrafficSimulation.recover` definition
Gamenot Dec 31, 2021
2ff5ef3
Add missing import
Gamenot Dec 31, 2021
4330279
Handle `SMARTS.reset(..)` errors
Gamenot Dec 31, 2021
fabe209
Set default recover to re-raise exception
Gamenot Dec 31, 2021
35b0b92
Address comments.
Gamenot Jan 6, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/ci-base-tests-linux.yml
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ jobs:
tests:
- ./envision
- ./smarts/contrib
- ./smarts/core
- ./smarts/core --nb-exec-timeout 65536
- ./smarts/env --ignore=./smarts/env/tests/test_rllib_hiway_env.py
- ./smarts/env/tests/test_rllib_hiway_env.py
- ./smarts/sstudio
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/ci-base-tests-mac.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ jobs:
- ./examples/tests --ignore=./examples/tests/test_learning.py
- ./smarts/sstudio
- ./smarts/env/tests/test_rllib_hiway_env.py
- ./smarts/core
- ./smarts/core --nb-exec-timeout 65536
- ./smarts/env --ignore=./smarts/env/tests/test_rllib_hiway_env.py
steps:
- name: Checkout
Expand Down
16 changes: 14 additions & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,17 @@ Copy and pasting the git commit messages is __NOT__ enough.
- Added `single_agent` env wrapper and unit test. The wrapper converts a single-agent SMARTS environment's step and reset output to be compliant with gym spaces.
- Added `rgb_image` env wrapper and unit test. The wrapper filters SMARTS environment observation and returns only top-down RGB image as observation.
- Added a "ReplayAgent" wrapper to allow users to rerun an agent previously run by saving its configurations and inputs. See Issue #971.
- Added `smarts.core.provider.ProviderRecoveryFlags` as flags to determine how `SMARTS` should handle failures in providers. They are as follows:
- `NOT_REQUIRED`: Not needed for the current step. Error causes skip of provider if it should recover but cannot or should not recover.
- `EPISODE_REQUIRED`: Needed for the current episode. Results in episode ending if it should recover but cannot or should not recover.
- `EXPERIMENT_REQUIRED`: Needed for the experiment. Results in exception if it should recover but cannot or should not recover.
- `ATTEMPT_RECOVERY`: Provider should attempt to recover from the exception or disconnection.
- Added recovery options for providers in `smarts.core.provider.Provider`. These include:
- Add `recover()` method to providers to attempt to recover from errors and disconnection.
- Add `connected` property to providers to check if the provider is still connected.
- Added recovery options to `smarts.core.smarts.SMARTS.add_provider()`
- Add `recovery_flags` argument to configure the recovery options if the provider disconnects or throws an exception.

### Changed
- `test-requirements` github action job renamed to `check-requirements-change` and only checks for requirements changes without failing.
- Moved examples tests to `examples` and used relative imports to fix a module collision with `aiohttp`'s `examples` module.
Expand All @@ -44,8 +55,8 @@ Copy and pasting the git commit messages is __NOT__ enough.
- Raised a warning message for building scenarios without `map.net.xml` file. See PR #1161.
### Fixed
- Fix lane vector for the unique cases of lane offset >= lane's length. See PR #1173.
- Logic fixes to the `_snap_internal_holes` and `_snap_external_holes` methods in `smarts.core.sumo_road_network.py` for crude geometry holes of sumo road map. Re-adjusted the entry position of vehicles in `smarts.sstudio.genhistories.py` to avoid false positive events. See PR #992.
- Prevent `test_notebook.ipynb` cells from timing out by increasing time to unlimited using `/metadata/execution/timeout=-1` within the notebook for regular uses, and `pytest` call with `--nb-exec-timeout -1` option for tests. See for more details: "https://jupyterbook.org/content/execute.html#setting-execution-timeout" and "https://pytest-notebook.readthedocs.io/en/latest/user_guide/tutorial_intro.html#pytest-fixture".
- Logic fixes to the `_snap_internal_holes` and `_snap_external_holes` methods in `smarts.core.sumo_road_network.py` for crude geometry holes of sumo road map. Re-adjusted the entry position of vehicles in `smarts.sstudio.genhistories.py` to avoid false positive events. See PR #992.
- Prevent `test_notebook.ipynb` cells from timing out by increasing time to unlimited using `/metadata/execution/timeout=65536` within the notebook for regular uses, and `pytest` call with `--nb-exec-timeout 65536` option for tests. See for more details: "https://jupyterbook.org/content/execute.html#setting-execution-timeout" and "https://pytest-notebook.readthedocs.io/en/latest/user_guide/tutorial_intro.html#pytest-fixture".
- Stop `multiprocessing.queues.Queue` from throwing an error by importing `multiprocessing.queues` in `envision/utils/multiprocessing_queue.py`.
- Prevent vehicle insertion on top of ignored social vehicles when the `TrapManager` defaults to emitting a vehicle for the ego to control. See PR #1043
- Prevent `TrapManager`from trapping vehicles in Bubble airlocks. See Issue #1064.
Expand All @@ -54,6 +65,7 @@ Copy and pasting the git commit messages is __NOT__ enough.
- Updated deprecated Shapely functionality.
- Fixed the type of `position` (pose) fields emitted to envision to match the existing type hints of `tuple`.
- Properly detect whether waypoint is present in mission route, while computing distance travelled by agents with missions in TripMeterSensor.
- Fixed `test_notebook` timeout by setting `pytest --nb-exec-timeout 65536`.
### Deprecated
- The `timestep_sec` property of SMARTS is being deprecated in favor of `fixed_timesep_sec`
for clarity since we are adding the ability to have variable time steps.
Expand Down
4 changes: 2 additions & 2 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,8 @@ test: build-all-scenarios
--forked \
--dist=loadscope \
-n `nproc --ignore 2` \
--nb-exec-timeout -1 \
./envision ./smarts/contrib ./smarts/core ./smarts/env ./smarts/sstudio ./tests ./examples/tests \
--nb-exec-timeout 65536 \
./examples/tests ./smarts/env ./envision ./smarts/contrib ./smarts/core ./smarts/sstudio ./tests \
--ignore=./smarts/core/tests/test_smarts_memory_growth.py \
--ignore=./smarts/core/tests/test_env_frame_rate.py \
--ignore=./smarts/env/tests/test_benchmark.py \
Expand Down
7 changes: 6 additions & 1 deletion smarts/contrib/pymarl/pymarl_hiway_env.py
Original file line number Diff line number Diff line change
Expand Up @@ -98,6 +98,7 @@ def __init__(self, config):
self._observations = None
self._state = None
self._steps = 0
self._dones_registered = 0

seed = self._config.get("seed", 42)
smarts.core.seed(seed)
Expand Down Expand Up @@ -193,6 +194,10 @@ def step(self, agent_actions):
infos["rewards_list"] = rewards

self._steps += 1
for done in dones.values():
self._dones_registered += 1 if done else 0

dones["__all__"] = self._dones_registered >= self.n_agents
infos["dones_list"] = np.array(list(dones.values()))
dones = infos["dones_list"]
if self._steps >= self.episode_limit:
Expand All @@ -203,7 +208,7 @@ def step(self, agent_actions):

def reset(self):
self._steps = 0

self._dones_registered = 0
scenario = next(self._scenarios_iterator)
observations = self._smarts.reset(scenario)
self._observations = np.asarray(
Expand Down
5 changes: 5 additions & 0 deletions smarts/core/agent_manager.py
Original file line number Diff line number Diff line change
Expand Up @@ -213,6 +213,10 @@ def observe(self, sim):
rewards[agent_id] = vehicle.trip_meter_sensor(increment=True)
scores[agent_id] = vehicle.trip_meter_sensor()

if sim.should_reset:
dones = {agent_id: True for agent_id in self.agent_ids}
dones["__sim__"] = True
Comment on lines +216 to +218
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea here is that we could treat the sim as also having the possibility to be done.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can come through the dones from SMARTS; however, at the gym level it would need to be translated.


return observations, rewards, scores, dones

def _vehicle_reward(self, vehicle_id, sim):
Expand Down Expand Up @@ -504,6 +508,7 @@ def _teardown_agents_by_ids(self, agent_ids, filter_ids: Set):
for agent_id in ids_:
self._agent_interfaces.pop(agent_id, None)

self._pending_agent_ids = self._pending_agent_ids - ids_
return ids_

def reset_agents(self, observations):
Expand Down
11 changes: 6 additions & 5 deletions smarts/core/bubble_manager.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@
from shapely.geometry import CAP_STYLE, JOIN_STYLE, Point, Polygon

from smarts.core.data_model import SocialAgent
from smarts.core.plan import Mission, Plan, PositionalGoal, Start
from smarts.core.plan import EndlessGoal, Mission, Plan, PositionalGoal, Start
from smarts.core.road_map import RoadMap
from smarts.core.utils.id import SocialAgentId
from smarts.core.utils.string import truncate
Expand Down Expand Up @@ -550,10 +550,11 @@ def _prepare_sensors_for_agent_control(
# Setup mission (also used for observations)
# XXX: this is not quite right. route may not be what the agent wants to take.
route = sim.traffic_sim.vehicle_route(vehicle_id=vehicle.id)
mission = Mission(
start=Start(vehicle.position[:2], vehicle.heading),
goal=PositionalGoal.from_road(route[-1], sim.scenario.road_map),
)
if len(route) > 0:
goal = PositionalGoal.from_road(route[-1], sim.scenario.road_map)
else:
goal = EndlessGoal()
Comment on lines +553 to +556
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this was a bug with a 0 length route being considered a positional goal when there is no end edge.

mission = Mission(start=Start(vehicle.position[:2], vehicle.heading), goal=goal)
plan.create_route(mission)

def _start_social_agent(
Expand Down
41 changes: 40 additions & 1 deletion smarts/core/provider.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,13 +18,27 @@
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
# THE SOFTWARE.
from dataclasses import dataclass, field
from typing import List, Optional, Set
from enum import IntFlag
from typing import List, Optional, Set, Tuple

from .controllers import ActionSpaceType
from .scenario import Scenario
from .vehicle import VehicleState


class ProviderRecoveryFlags(IntFlag):
"""This describes actions to be taken with a provider should it fail."""

NOT_REQUIRED = 0x00000000
"""Not needed for the current step. Error causes skip."""
EPISODE_REQUIRED = 0x00000010
"""Needed for the current episode. Results in episode ending."""
EXPERIMENT_REQUIRED = 0x00000100
"""Needed for the experiment. Results in exception if an error is thrown."""
ATTEMPT_RECOVERY = 0x00001000
"""Provider should attempt to recover from the exception or disconnection."""
Gamenot marked this conversation as resolved.
Show resolved Hide resolved


@dataclass
class ProviderState:
vehicles: List[VehicleState] = field(default_factory=list)
Expand Down Expand Up @@ -76,3 +90,28 @@ def reset(self):

def teardown(self):
raise NotImplementedError

def recover(
self, scenario, elapsed_sim_time: float, error: Optional[Exception] = None
) -> Tuple[ProviderState, bool]:
"""Attempt to reconnect the provider if an error or disconnection occured.
Implementations may choose to e-raise the passed in exception.
Args:
scenario (Scenario): The scenario of the current episode.
elapsed_sim_time (float): The current elapsed simulation time.
error (Optional[Exception]): An exception if an exception was thrown.
Returns:
bool: The success/failure of the attempt to reconnect.
"""
if error:
raise error
return ProviderState(), False

@property
def connected(self) -> bool:
"""Determine if the provider is still responsive. (e.g. the case that the provider is
sending provider state over the internet and has stopped responding)
Returns:
bool: The connection state of the provider.
"""
return True
Loading