Policy gradients (Reinforce)#
In this lesson, we will be discussing our first algorithm, policy gradients, also known as REINFORCE, which is one of the policy-based algorithms (those that optimize the policy directly).
For our agent to train, we need to have a policy that updates learning in the environment in a way that maximizes an objective function.
The objective function#
First, we need to define the return of a trajectory. A trajectory is simply a sequence of states \(s\), actions \(a\), and rewards \(r\) encountered by an agent in the environment as it interacts over time. Formally, a trajectory \(\tau\) is represented as:
Where:
\(s_t\) is the state at time step \(t\),
\(a_t\) is the action taken at time step \(t\),
\(r_{t+1}\) is the reward received after taking action \(a_t\) in state \(s_t\).
The return \(G_t\) of a trajectory is the total accumulated reward starting from time step \(t\) and can be defined as the sum of all rewards obtained from \(t\) to the end of the episode (or trajectory). If the trajectory ends after \(T\) time steps, the return is:
In many RL settings, like this one, a discount factor \(\gamma\) (where \(0 \leq \gamma \leq 1\)) is applied to future rewards to account for the fact that rewards obtained earlier in time are usually more valuable than those obtained later. In that case, the return is given by the discounted sum of future rewards:
Or equivalently:
This formulation allows the agent to weigh immediate rewards more heavily than distant future rewards, which can be useful in environments with long time horizons.
In summary, the return of a trajectory is the total discounted reward the agent accumulates from a given time step until the end of the episode.
Having previously explained the return of a trajectory as the discounted sum of future rewards, we can now define the objective function for policy gradients. The goal is to maximize the expected return over all possible trajectories generated by our policy. This can be expressed as:
Where:
\(J(\theta)\) is the objective function, representing the expected total reward,
\(\tau\) is a trajectory, a sequence of states, actions, and rewards,
\(P(\tau|\theta)\) is the probability of trajectory \(\tau\) occurring under the policy parameterized by \(\theta\),
\(G(\tau)\) is the return (total reward) accumulated along trajectory \(\tau\).
This objective function reflects the goal of policy gradients: to optimize the policy parameters \(\theta\) in order to maximize the expected return. By doing so, the agent learns to increase the probability of actions that lead to higher rewards.
The update rule is derived from the likelihood ratio of actions taken in relation to the rewards they produced. This is done using the log probability of the actions taken during each trajectory:
This means that we adjust the policy parameters based on how much each action contributes to the return. The agent increases the probability of actions that lead to higher rewards, helping it improve its decisions with every trajectory it experiences.
To maximize the objective function, we use gradient ascent, which updates the policy parameters \(\theta\) in the direction of the gradient of the objective function. This method increases the objective function by moving in the direction of the steepest ascent.
Note: Gradient ascent is the opposite of gradient descent, an optimization algorithm that adjusts parameters in the direction of the negative gradient of a loss function to minimize it.

Monte Carlo sampling
In the REINFORCE algorithm, Monte Carlo sampling is used to estimate the return of a trajectory by sampling entire episodes (or trajectories) from the environment.
The basic process of Monte Carlo sampling in REINFORCE works as follows:
Sample a trajectory: The agent interacts with the environment by following its current policy, generating a trajectory \(\tau = (s_0, a_0, r_1, s_1, \dots, s_T, a_T)\) until the episode ends. Compute the return: For each time step \(t\) in the trajectory, compute the total reward (return) from that point onward:
where \(G_t\) is the return at time step \(t\), \(\gamma\) is the discount factor, and \(r_k\) is the reward at time step \(k\).
Update policy parameters: Use the return \(G_t\) as an estimate of the expected reward to update the policy parameters \(\theta\) using the gradient of the log-probability of the taken actions:
Here, \(\pi_\theta(a_t | s_t)\) is the probability of taking action \(a_t\) in state \(s_t\) under the current policy, and \(\alpha\) is the learning rate.
By repeatedly sampling trajectories and updating the policy based on the returns from those samples, the agent improves its policy over time. In summary, Monte Carlo sampling allows REINFORCE to estimate the return from actual sampled trajectories, without needing a model of the environment, and to update the policy based on those samples.
The Policy Gradient Process#
The Policy Gradient (REINFORCE) algorithm updates the agent’s policy directly based on the returns from sampled trajectories. The following steps outline how the policy gradient algorithm works in practice:
Initialize the policy parameters: Start by initializing the policy parameters \(\theta\) randomly. These parameters define the agent’s policy \(\pi_\theta(a | s)\), which gives the probability of taking action \(a\) in state \(s\).
For each episode:
Observe the current state \(s_0\).
Sample actions from the policy: The agent selects an action \(a_t\) in each state \(s_t\) according to its current policy \(\pi_\theta(a_t | s_t)\). This involves sampling actions based on the probability distribution defined by the policy.
Execute the action and observe the reward \(r_{t+1}\) and the next state \(s_{t+1}\).
Store the rewards and log probabilities of the actions taken throughout the episode.
Compute the returns: Once the episode is completed, compute the return \(G_t\) for each time step \(t\), which is the total discounted reward starting from that step:
\[ G_t = \sum_{k=t}^{T} \gamma^{k-t} r_{k} \]Update the policy parameters: After collecting the returns, update the policy parameters \(\theta\) using gradient ascent:
\[ \theta \leftarrow \theta + \alpha G_t \nabla_\theta \log \pi_\theta(a_t | s_t) \]This update moves the policy parameters in the direction that maximizes the expected return.
Repeat for many episodes: Over time, as the policy is updated based on the returns of sampled trajectories, the agent’s performance should improve, and the policy will converge to one that maximizes the total reward.
Coding Policy Gradients#
import torch
import numpy as np
from torch.distributions import Categorical
class PGCar:
def __init__(self,):
self.model = self.create_model() #1. Intilialize the policy parameters
def forward(self, state):
state = np.array(state, dtype=np.float32)
state = torch.from_numpy(state).float().unsqueeze(0).to(self.device)
probs = self.model(state)
m = Categorical(probs)
action = m.sample() # Sample actions from the policy
self.log_probs.append(m.log_prob(action)) # Store log probabilities of the actions taken
return action.item()
def action_train(self, state):
action = self.forward(state)
# Execute the action
if action == 0:
self.angle += 10 # Left
elif action == 1:
self.angle -= 10 # Right
elif action == 2:
if self.speed - 2 >= 6:
self.speed -= 2 # Slow Down
else:
self.speed += 2 # Speed Up
def train(self, rewards):
self.optimizer.zero_grad()
returns = []
future_return = 0
for r in reversed(rewards): # 3. Compute the discounted return
future_return = r + self.discount_factor * future_return
returns.insert(0, future_return)
returns = torch.tensor(returns).to(self.device)
policy_loss = []
for log_prob, R in zip(self.log_probs, returns):
policy_loss.append(-log_prob * R)
policy_loss = torch.stack(policy_loss).sum()
policy_loss.backward() # Update the policy parameters
self.optimizer.step()
self.onpolicy_reset()
return policy_loss.item()
class PGRace:
def training_race(self, car: PGCar, episodes =50):
for episode in range(1, episodes + 1): #2. For each episode
current_state = car.get_data() # Observe the state
states, rewards = [], []
done = False
episode_reward = 0
while not done:
car.action_train(current_state) #Sample over actions and execute the action
new_state, reward, done = self.step(car) # Observe the new state
episode_reward += reward
current_state = new_state
loss = car.train(rewards)
Actual training#
Highlights#
[02:22] The car makes it through the first curve.
[03:20] The car makes the first long run.
[05:06] The car completes a full lap. 🎉