How does `reset` work? #119

AdityaGudimella · 2022-05-15T23:40:46Z

AdityaGudimella
May 15, 2022

The reset function doesn't actually seem to reset the environment, at least for the "Pong-v5" env. In the following code, I'm using a random policy to step the env. I first step the env n_init_steps number of times, then reset it and then run 3 episodes to completion and check the episode returns for the 3 episodes. When I reset the env, I expect that the episode return is still close to -21 because the policy is random. But as I increase n_init_steps, the episode return starts decreasing. If I don't reset the env after n_init_steps number of steps, the episode returns for 3 episodes are close to -21 as expected. Am I doing something wrong here? The episode returns calculated assuming the env was not reset shows numbers around -21. Am I doing something wrong or is this a bug?

def test_envpool_resets_correctly() -> None:
    def gather_rewards(n_init_steps: int, reset: bool = True):
        env = envpool.make_gym("Pong-v5", num_envs=1, seed=0)
        ep_returns: list[float] = []
        def policy():
            return np.asarray([env.action_space.sample()])
        curr_ep_return = 0
        for _ in range(n_init_steps):
            _, rewards, dones, _ = env.step(policy())
            curr_ep_return += rewards.item()
            if dones.item():
                ep_returns.append(curr_ep_return)
                curr_ep_return = 0
        old_ep_return = curr_ep_return
        if reset:
            env.reset()
            curr_ep_return = 0
        # Copy ep_returns to avoid changing the original list
        # old_ep_returns assumes env was not reset
        old_ep_returns = [x for x in ep_returns]
        for _ in range(3):
            while True:
                _, rewards, dones, _ = env.step(policy())
                curr_ep_return += rewards.item()
                old_ep_return += rewards.item()
                if dones.item():
                    old_ep_returns.append(old_ep_return)
                    ep_returns.append(curr_ep_return)
                    old_ep_return = 0
                    curr_ep_return = 0
                    break
        return ep_returns, old_ep_returns

    print(gather_rewards(0))  # ([-21.0, -21.0, -19.0], [-21.0, -21.0, -19.0])
    print(gather_rewards(500))  # ([-10.0, -21.0, -21.0], [-21.0, -21.0, -21.0])
    print(gather_rewards(800))  # ([-2.0, -21.0, -21.0], [-20.0, -21.0, -21.0])
    print(gather_rewards(500, False))  # ([-20.0, -20.0, -21.0], [-20.0, -20.0, -21.0])
    print(gather_rewards(800, False))  # ([-21.0, -20.0, -20.0, -21.0], [-21.0, -20.0, -20.0, -21.0])