This project uses Stables Baselines 3 to implement Proximal Policy Optimization to simulate spacecraft landing. Our graphics and simulation engine are implemented on C++
, while model training and inference are in Python
.
Install the relevant files in requirements.txt
. To quickly view a comparison of PPO against proportional control, navigate to src/lander_py/train.py
and run this file to train the PPO model. Then run benchmark_agents
to plot the performance of these 2 methods.
I've abstracted away most of the C++ codebase using Pybind11
into a module lander_agent_cpp.so
available in the build/
folder. This provides the lander_agent_cpp.so
class, which is the interface where we interact with the C++
environment.
Unfortunately, I've not integrated the graphics engine with RL yet. This is because the
C++
codebase uses almost pure global variables and global functions, which makes encapsulation and abstraction incredibly difficult!
If you'd also like to run the graphics engine, set render=true
and agent_flag=false
in main.cpp
to run the interactive graphics engine. Then build the project using the CMake
file using the instructions below.
if landeded safely:
return 100.0
else if crashed:
return -100.0
else:
return -1.0
The proportional control algorithim did much better than PPO, unfortunately due to the problem of sparse rewards the agent was unable to successfully get positive signal to decrease its velocity to below 0.5m/s. While during training steps were taken to encourage exploration (like increasing the entropy coefficient, see Equation 9 of the original PPO paper, and decreasing the penalty on each time step), the model was still unable to learn a good enough landing algorithim.
Our observation (state) space consists of the velocity vector
We only ever obtain rewards at the end of the episode, which makes getting signal to update the weights correctly extremely difficult. Using a simulation timestep $ \delta t$ of
We introduce a slight negative reward on every step to prevent the agent from being "lazy" and artifically prolonging the simulation
Perhaps the landing problem is too complex, and we can start off with a simpler problem having a more robust reward function providing rewards throughout the episode. I tried setting the kinetic energy as the reward function (we add a negative sign as to minimize it):
kinetic_energy = 0.5 * np.sum(velocity_array**2)
return -kinetic_energy
The model successfully learned how to hover, albeit in an extremely stochastic way.
We can see the altitude remaining more or less constant, with zero velocity before the fuel is used up, and throttling no longer works
potential_energy = -(self.GRAVITY_CONSTANT * self.MARS_MASS) / np.linalg.norm(
position_array
)
kinetic_energy = 0.5 * np.sum(velocity_array**2)
return -(potential_energy + kinetic_energy)
Inspired by the success of learning to hover, we try to set the sum of the potential energy and the kinetic energy as the reward function. After all, the minimum energy state is when the model successfully lands - at the surface with zero velocity. Despite nice theoretical properties, this reward function did not work in practice.
After playing around with various forms of reward functions, the one that was closest to successfully landing is of this form:
kinetic_energy = 0.5 * np.sum(velocity_array**2)
return -(altitude**2 + 1.5 * kinetic_energy)
where the agent did learn to decrease it throttle upon approaching the surface, but not quite good enough to decrease its landing speed to below the required 0.5m/s.
.
├── CMakeLists.txt
├── README.md
├── requirements.txt
├── src
│ ├── assignment2.py
│ ├── lander_cpp
│ │ ├── agent.cpp
│ │ ├── agent_wrapper.cpp
│ │ ├── autopilot.cpp
│ │ ├── lander.cpp
│ │ ├── lander.h
│ │ ├── lander_graphics.cpp
│ │ ├── lander_mechanics.cpp
│ │ └── main.cpp
│ ├── lander_py
│ │ ├── benchmark_agents.py
│ │ ├── lander_env.py
│ │ ├── test_lander_agent_cpp.py
│ │ ├── test_lander_env.py
│ │ └── train.py
| | └── models/
│ └── spring
│ ├── assignment1.py
│ ├── assignment3.cpp
│ ├── spring.cpp
│ ├── spring.py
│ └── visualize_cpp.py
└── utils.py
src/lander_cpp/
contains the OpenGL
graphics method in lander_graphics
and core numerical simulation methods in lander_mechanics
. agent.cpp
contains our Agent
interface with Python, while agent_wrapper
uses Pybind
to wrap around our Agent
class so that we can call them in Python. autopilot
contains the C++ proportional controller implementation. main.cpp
contains the main function to display the graphics engine and run an agent without using the engine.
As basically all functions and variables are global, it's very difficult to containerize methods and integrate our
Agent
class seamlessly into the graphics engine. Major refactoring is needed to containerizeupdate_lander_state()
,autopilot()
,numerical_dynamics()
,update_visualization()
by taking in ourAgent
class as a parameter, or accessing variables locally
In src/lander_py/
, this contains code to train and do evaluate our agents. test_lander_agent_cpp
tests whether pybind11
has successfully translated all our methods, while test_lander_env
tests whether our gymnasium
environment is working as intended. models/
do contain the zip files for all our model weights.
spring/
contains some of the assignment code for simulating simple harmonic motion.
We use Cmake
to compile our lander project (but not for spring). Make sure you have Cmake
, Pybind11
,OpenGL
and GLUT
installed for the C++
projects.
- Ensure
CMakeLists.txt
is in the root directory - Create a build directory in the root:
mkdir build && cd build
- Run CMake:
cmake ..
- Build the project:
make
After compiling, you can do
import build.lander_agent_cpp as lander_agent_cpp
to use modules from our C++Agent
class, and run C++ modules directly from Python
- After building, we can run
./lander
frombuild
folder, to runmain.cpp
- To use the modules in
lander_agent_cpp.so
, to import from this file like a regular Python files containing classes
To set up OpenGL to display correctly on WSL, follow these instructions to setup VcXsrv server.
- use VcXsrv to open a new server
- set
export DISPLAY=[Your IP Address]:0
where0
means your screen number. You should've set this to zero previously. - change
LIBGL_ALWAYS_INDIRECT=1
Previously, the repo used a flag to selectively declare variables. All variables are now declared as extern
in lander.h
, and fully declared at the start of the file where they are first used.
-
Ideally we would use singleton classes here or encapsulated classes, but there's simply too many variables interacting with each other
-
added a
render
variable that allows me to run simulations withoutGLUT
- currently, only
render=true, agent_flag=false
andrender=false,agent_flag=true
are supported
- currently, only
- Making model (actor, critic smaller)
- Changing reward function to be mechanical energy
- PPO Params
- Increasing entropy
- Batch size
- num epochs
- Learning rate
- Discrete Action space instead of continuous?
- Often, training too long can decrease performance (mean reward decreases)
- First make the landing condition easier (lower threshold for landing), then slowly decrease threshold
-
Important to normalize observations space, especially when observations have a large range
- the
gym
package'sNormalizeObservations
wrapper environment is very convenient for this! - keeps a running mean of the observation, but note that the way it does this is by wrapping around the
step
function - this is why for plotting, you only plot stuff defined in the
step
function for info- important things like the un-normalized observations
- the stuff coming out of
step
should only be what the model sees
- the
-
Really crucial to normalize rewards using the
NormalizeReward
wrapper environment which keep the exponential moving average having a fixed variance- this is extremely helpful as I don't need to manually set constants to change my reward
- If you don't normalize rewards and they are too big, somehow decreases with training instead?
-
I also "normalized" the actions space by doing a linear transformation from the original action space of
$throttle \in [0,1]$ to$[-1,1]$ so that the model can learn better -
So currently, our base
LanderEnv
accepts transformed actions and outputs untransformed observations