[bug] EOF error with gym.vector.AsyncVectorEnv() when calling the step method. #3281

jng164 · 2024-07-08T12:44:56Z

Describe the bug
The code suddenly reaches an EOF error when calling the step method after 12M steps of training.

Code example
I am using gym.vector.AsyncVectorEnv(). I use the function make_envto create my environments.

def make_env(gym_id, seed, idx, capture_video, run_name, qubits, depth):
    
    def thunk():
        env = gym.make(gym_id, qubits=qubits, depth=depth, env_id=idx)
        env = gym.wrappers.RecordEpisodeStatistics(env)
        if capture_video and idx == 0:
            env = gym.wrappers.RecordVideo(env, f"videos/{run_name}")
        return env

    return thunk

The main part of the code is as follows:

if __name__ == "__main__":
    mp.set_start_method('spawn')
    device = torch.device("cuda" if torch.cuda.is_available() and args.cuda else "cpu")
    envs = gym.vector.AsyncVectorEnv(
        [make_env(args.gym_id, args.seed + i, i, args.capture_video, run_name, qubits, depth) for i in range(args.num_envs)],
    shared_memory=False)
    agent = AgentGNN(envs, device).to(device)#Graph Neural Network
    for update in range(1, num_updates + 1):
        for step in range(args.num_steps):  
            global_step += 1 * args.num_envs
            dones[step] = next_done
            try:
                with torch.no_grad():
                    action, logprob, _, value, logits, action_ids = agent.get_action_and_value(next_obs_graph, device=device)
                    values[step] = value.flatten()
                actions[step] = action
                logprobs[step] = logprob
                
                next_obs, reward, done, deprecated, info = envs.step(action_ids.cpu().numpy()) 
            except TypeError as e:
                print(f"Error: {e}")
            rewards[step] = torch.tensor(reward).to(device).view(-1)

            next_done = torch.Tensor(done).to(device)

As far as I understand the error, this code generates as much threads as environments I want. In one particular thread , the agent breaks in env.step(). As you can see, I tried to solve this issue with a try-except, but this does not work. I think this can be because the thread just keeps on hold until it breaks but I am not sure.

Traceback

Traceback (most recent call last):
  File "/home/jriu/Copt-cquere/rl-zx/ppo.py", line 204, in <module>
    next_obs, reward, done, deprecated, info = envs.step(action_ids.cpu().numpy())
  File "/home/jriu/anaconda3/envs/cquere/lib/python3.10/site-packages/gym/vector/vector_env.py", line 137, in step
    return self.step_wait()
  File "/home/jriu/anaconda3/envs/cquere/lib/python3.10/site-packages/gym/vector/async_vector_env.py", line 320, in step_wait
    result, success = pipe.recv()
  File "/home/jriu/anaconda3/envs/cquere/lib/python3.10/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
  File "/home/jriu/anaconda3/envs/cquere/lib/python3.10/multiprocessing/connection.py", line 414, in _recv_bytes
    buf = self._recv(4)
  File "/home/jriu/anaconda3/envs/cquere/lib/python3.10/multiprocessing/connection.py", line 383, in _recv
    raise EOFError
EOFError
/home/jriu/anaconda3/envs/cquere/lib/python3.10/site-packages/gym/vector/async_vector_env.py:457: UserWarning: WARN: Calling `close` while waiting for a pending call to `step` to complete.
Exception ignored in: <function AsyncVectorEnv.__del__ at 0x7ea18eb856c0>
Traceback (most recent call last):
  File "/home/jriu/anaconda3/envs/cquere/lib/python3.10/site-packages/gym/vector/async_vector_env.py", line 546, in __del__
  File "/home/jriu/anaconda3/envs/cquere/lib/python3.10/site-packages/gym/vector/vector_env.py", line 205, in close
  File "/home/jriu/anaconda3/envs/cquere/lib/python3.10/site-packages/gym/vector/async_vector_env.py", line 461, in close_extras
  File "/home/jriu/anaconda3/envs/cquere/lib/python3.10/site-packages/gym/vector/async_vector_env.py", line 320, in step_wait
  File "/home/jriu/anaconda3/envs/cquere/lib/python3.10/multiprocessing/connection.py", line 250, in recv
  File "/home/jriu/anaconda3/envs/cquere/lib/python3.10/multiprocessing/connection.py", line 414, in _recv_bytes
  File "/home/jriu/anaconda3/envs/cquere/lib/python3.10/multiprocessing/connection.py", line 383, in _recv
EOFError:

System Info
I use gym 0.26.2, torch 2.0.1 and python 3.10.14. I am using Ubuntu 24.04 LTS. All of the packages were installed using pip.

Additional context
Add any other context about the problem here.

Checklist

I have checked that there is no similar issue in the repo (required)

The text was updated successfully, but these errors were encountered:

Fengwenhao01 · 2024-07-24T05:12:27Z

l have the same problem.

antoniopioricciardi · 2025-01-13T21:31:11Z

Same, with gym 0.29.1, torch 2.0.1, python 3.9.18 and Pop!_OS (an Ubuntu distro) 22.04.

No issue running with SyncVectorEnv, or running Async on Mac. My python environment is installed via uv pip.

w1463442883 · 2025-01-13T21:32:51Z

这是来自QQ邮箱的假期自动回复邮件。你好，我最近正在休假中，无法亲自回复你的邮件。我将在假期结束后，尽快给你回复。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug] EOF error with gym.vector.AsyncVectorEnv() when calling the step method. #3281

[bug] EOF error with gym.vector.AsyncVectorEnv() when calling the step method. #3281

jng164 commented Jul 8, 2024

Fengwenhao01 commented Jul 24, 2024

antoniopioricciardi commented Jan 13, 2025

w1463442883 commented Jan 13, 2025 via email

[bug] EOF error with gym.vector.AsyncVectorEnv() when calling the step method. #3281

[bug] EOF error with gym.vector.AsyncVectorEnv() when calling the step method. #3281

Comments

jng164 commented Jul 8, 2024

Checklist

Fengwenhao01 commented Jul 24, 2024

antoniopioricciardi commented Jan 13, 2025

w1463442883 commented Jan 13, 2025 via email