A2C (Advantage Actor-Critic)#
In this lesson, we will explore the Advantage Actor-Critic (A2C) algorithm, a popular method that combines the strengths of policy-based and value-based reinforcement learning techniques. While there are both synchronous and asynchronous versions of A2C, in this lesson, we will focus on the core concepts and implement A2C using a single agent interacting with the environment. This will help us understand how the core idea works without getting into the parallelism or technicalities of synchronization.
A2C improves upon vanilla Policy Gradient methods by incorporating the value function to reduce variance during training and accelerate learning. Let’s explore the details of how A2C works.
Actor-Critic Overview#
In Actor-Critic (AC) methods, the agent consists of two primary components:
Actor: This is the policy function \(π(a∣s)\), which maps states to actions. The actor is responsible for deciding which action to take in a given state.
Critic: This is the value function \(V(s)\) or the Q-value function \(Q(s,a)\), which evaluates the quality of the actions chosen by the actor. It helps the actor improve by providing feedback on how good the chosen actions were in terms of expected rewards.
The Advantage Function#
The key idea behind A2C is the advantage function, which tells the agent how much better (or worse) a particular action is compared to the average action in a given state.
Mathematically, the advantage function is defined as:
Where:
\(Q(s,a)\) is the expected return (cumulative reward) after taking action \(a\) in state \(s\),
\(V(s)\) is the expected return from state \(s\) under the current policy.
The advantage quantifies how much better or worse the selected action \(a\) is compared to the average action the agent could take in state \(s\).
Why Use Advantage?#
Using the advantage function helps reduce the variance in policy gradient updates. Instead of just updating based on rewards (as in vanilla Policy Gradient methods), the advantage provides a more stable target for updating the actor (policy), leading to more reliable learning.
A2C Algorithm#
The A2C process involves:
Actor Loss (Policy Gradient)
The goal of the actor is to maximize the expected return by adjusting the policy to favor actions that lead to higher rewards. The policy gradient is calculated using the advantage function \(A(s,a)\):
Where:
\(logπ(a∣s)\) is the log probability of taking action aa in state ss under the current policy.
\(A(s,a)\) is the advantage, indicating how much better or worse this action was compared to the baseline.
Critic Loss (Value Function)
The critic helps by estimating the value of states. The critic’s goal is to minimize the mean squared error (MSE) between the predicted value and the actual return (bootstrapped from future rewards):
Where:
\(r\) is the reward received from the environment,
\(γ\) is the discount factor,
\(V(s)\) is the value estimate for the current state,
\(V(s′)\) is the value estimate for the next state.
Total Loss
The total loss is a combination of the actor loss and the critic loss, with an optional entropy bonus to encourage exploration:
Here:
\(c1\) and \(c2\) are hyperparameters that control the contribution of the critic loss and entropy bonus.
The entropy bonus encourages the policy to explore more by penalizing highly confident actions (i.e., encouraging more exploration early in training).
Key Components of A2C#
Advantage Function#
Unlike Q-learning and SARSA, which directly estimate the action-value function \(Q(s,a)\), A2C uses the Advantage Function to capture how much better an action is compared to the baseline (i.e., the value of the current state).
The advantage is calculated as:
This difference between the expected value and the observed return helps stabilize the updates and reduces variance in the gradient estimates.
The A2C Process#
Initialize the Actor and Critic Networks
Start by initializing two neural networks:
The actor network to approximate the policy \(πθ(a∣s)\),
The critic network to estimate the value function \(V(s)\).
Interact with the Environment
At each time step, the agent:
Observes the current state ss,
Chooses an action aa according to the policy from the actor,
Executes the action, observes the reward \(r\) and the next state \(s′\).
Compute the Advantage
Once the reward \(r\) and the next state \(s′\) are known, the advantage is computed as:
\[A(s,a)=r+γV(s′)−V(s)\]Update the Networks
Actor Update: Adjust the policy (actor) using the policy gradient and the advantage function.
Critic Update: Minimize the TD error to improve the value estimates.
Repeat the process for multiple episodes, allowing the agent to refine its policy and value estimates over time.
Coding A2C#
from torch import nn
import torch
class ActorCritic(nn.Module):
def __init__(self, input_size, hidden_size, output_size):
super(ActorCritic, self).__init__()
self.shared = nn.Sequential(nn.Linear(input_size, hidden_size), nn.ReLU())
self.actor = nn.Sequential(
nn.Linear(hidden_size, output_size), nn.Softmax(dim=-1)
)
self.critic = nn.Linear(hidden_size, 1)
def forward(self, x):
shared = self.shared(x)
return self.actor(shared), self.critic(shared)
class A2Car():
def __init__(self, input_size, hidden_size, output_size):
self.model = ActorCritic(input_size, hidden_size, output_size)
def select_action(self, state):
state = torch.FloatTensor(state).unsqueeze(0).to(self.device)
probs, value = self.model(state)
m = torch.distributions.Categorical(probs)
action = m.sample()
log_prob = m.log_prob(action)
entropy = m.entropy()
self.log_probs.append(log_prob)
self.values.append(value)
self.entropies.append(entropy)
return action.item()
def action_train(self, state):
action = self.select_action(state)
if action == 0:
self.angle += 10 # Left
elif action == 1:
self.angle -= 10 # Right
elif action == 2:
if self.speed - 2 >= 6:
self.speed -= 2 # Slow Down
else:
self.speed += 2 # Speed Up
def train(self):
returns = []
R = 0
for reward in self.rewards[::-1]:
R = reward + self.discount_factor * R
returns.insert(0, R)
returns = torch.tensor(returns).to(self.device)
log_probs = torch.stack(self.log_probs)
values = torch.stack(self.values).squeeze()
entropies = torch.stack(self.entropies)
advantages = returns - values.detach()
actor_loss = -(log_probs * advantages.detach()).mean()
critic_loss = advantages.pow(2).mean()
entropy_loss = -entropies.mean()
loss = actor_loss + self.critic_weight * critic_loss + self.entropy_weight * entropy_loss
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
self.reset_episode()
return loss.item()
class A2CRace():
def training_race(self, car: A2Car, episodes, train_every):
for episode in range(1, episodes + 1):
car.reset_episode()
current_state = car.get_data()
done = False
episode_reward = 0
while not done:
car.action_train(current_state)
new_state, reward, done = self.step(car)
car.rewards.append(reward)
episode_reward += reward
current_state = new_state
loss = car.train()