The main repo for the WiDS 2024 project - Gambling with RL
We'll be focusing on the basics of RL and essential Python programming skills, laying the groundwork for further exploration.
Our main aim would be to get a general intro to RL and explore the fundamentals of Markov Decision Processes (MDPs), focusing on their structure, components, and role in modeling decision-making in reinforcement learning.
This week, we'll dive into basic RL algorithms, specifically Policy and Value Iteration. Our goal is to understand how these fundamental techniques help solve MDPs by iteratively improving policies and estimating value functions. Tasks:
-Carefully read Chapter 3 of Grokking Deep Reinforcement Learning to grasp the concepts of Policy and Value Iteration. -Solve the upcoming assignment to reinforce your understanding.
Watch the provided video resources for a deeper understanding of expected returns, policies, value functions, and optimal policies. These will complement the reading and help clarify key concepts.
This week, we dive into Multi-Armed Bandits (MAB), a fundamental problem in Reinforcement Learning (RL) that serves as a simplified model for decision-making under uncertainty. Unlike Markov Decision Processes (MDPs), where actions influence future states, MAB problems focus on immediate rewards, making them a key concept in online learning and exploration-exploitation trade-offs.
Before starting the chapter, read Section 2.1 and the beginning of 2.2 from Page 9-10 of this resource. Then, move on to:
- Grokking Chapter 4: Read the chapter carefully and solve the upcoming assignment.
- Sutton and Barto, Chapter 2: Recommended after completing Grokking for deeper insights.
To enhance your understanding, read about the regret function here, which quantifies how much reward is lost due to not choosing the optimal action at every step.
This week, we focus on implementing Multi-Armed Bandits and exploring different strategies to balance exploration vs. exploitation effectively.
- Implement and compare the following MAB algorithms:
- ϵ-Greedy Algorithm
- Upper Confidence Bound (UCB)
- Thompson Sampling
- Analyze their performance on different reward distributions.
- Visualize the regret for each algorithm to understand their efficiency over time.
Use the provided notebook template to implement these algorithms.
This week, we officially start gambling in RL with Blackjack! We’ll explore Temporal Difference (TD) learning, a powerful method that generalizes Monte Carlo learning and allows for more flexible policy evaluation.
Theory:
- Carefully read Chapter 5 of Grokking Deep Reinforcement Learning and solve the upcoming assignment.
Assignment:
- Before training your RL agent to play Blackjack, play the game yourself to get familiar with its rules. Read the official documentation to understand the gameplay.
- We’ll use the Gym library’s “Blackjack-v1” environment to implement TD-learning.
- While Monte Carlo methods require complete episode rollouts, TD-learning updates value estimates in an online fashion, making it more efficient.
- The λ parameter in TD-learning allows for flexibility:
- Setting λ = 1 recovers Monte Carlo learning.
- Experiment with different λ values to understand their impact on policy learning.
Implementation:
- The helper code is already provided in the notebook—just run the initial cells.
- Implement the BlackjackTDAgent class to train your RL agent.
- Finally, watch your agent play Blackjack and (hopefully) win loads of money—if only it were real! 💰😆
This week marks our transition into real-world-like RL problems. Unlike the earlier phases, which had small and well-defined action and state spaces, we now delve into complex environments where Deep Learning enables us to train Deep Reinforcement Learning (DRL) networks to handle high-dimensional state spaces and continuous action spaces.
Previously, we explored Value Iteration (VI), Policy Iteration (PI), and basic Q-Learning—fundamental concepts in RL. However, these methods struggle to scale to large or infinite state spaces. To tackle this, we introduce function approximation techniques using Neural Networks, which form the backbone of Deep RL.
A neural network is used to approximate the Q-value function, enabling us to handle large state spaces, such as raw pixels from Atari games.
While other techniques like Policy Gradient & Actor-Critic exist, we will focus on DQN and similar methods for now.
- Start by watching this introductory video to understand Neural Networks (NNs).
- You can initially treat Neural Nets as Black Box Function Approximators and dive deeper later.
- Q-Learning vs Deep Q-Learning + Slippery Frozen Lake using DQN video
- OG DRL Paper Explanation - paper
- Dueling DQN - Full Implementation Playlist - Playlist
By now, you should be comfortable working with DQN. Using the Frozen Lake DQN template, your tasks are:
- Implement the Gymnasium Lunar-Lander environment using DQN
- Solve Atari Breakout
For Reference:
- A Cart Pole environment with a solved implementation is available in the repo. Use it as a guide.