Skip to content

Latest commit

 

History

History
31 lines (18 loc) · 2.29 KB

File metadata and controls

31 lines (18 loc) · 2.29 KB

Reinforcement learning

Machine Learning

Area of AI based on algorithms capable of learning, extracting knowledge. Extract knowledge, they cannot create it. The idea is to build software that can make decisions on new unseen data. We use ML mainly in problems where its too difficult to define rules for the agents.

data $\rightarrow$ experience

Supervised learning

Desired outputs. The agent tries to reach them.

Unsupervised learning

Just tries to find patterns/regularities in the data.

How Reinforcement Learning works

The agent receives rewards based on how it performs. The only goal of the agent is to maximise the long term rewards. The agent needs to find a balance between 'exploitation and exploration' -> there is the 'exploitation and exploration dilemma'. There are many policies .. a greedy policy where the agent tries to perform the best rewarding action for each state. A $\epsilon$-greedy policy where the agent with probability $\epsilon$ perform a random action (exploration phase, the agent maybe discovers a better path with this action).

Also, the enviroment must satisfy the Markov Property:

History leads me here, but the next state and reward depends only on the current state/action . It's also important to design the right Reward Function.

Q-learning is a famous algorithm used by this class of learning agents. It's based on a Q table where the Q value of each state rapresent the 'reward' of the state. The Q-table so it composed by all the possible states (rows) and for each state are considered all the possible actions (columns), then we fill each cell with the immediate reward of that action in that states. Later we perform a continue approximation of the reward of the states considering also the long term reward, using this formula: $$Q(s, a) \leftarrow (1 - \alpha)Q(s, a) + \alpha( r(s, a) + \gamma(max_{a \in A} \space Q(s',a')))$$ (in this course we use a slighty simplified version of the Q-learning). Where the the first term is the 'exploitation phase' and the second term is the 'exploration phase' where the agent considered also the long term award. $\alpha$ is the factor that is balancing exploitation and exploration and $\gamma$ is the discount factor: more $\gamma$ is greater and more you are giving importance to the long term reward.