Actor-Critic Methods in Reinforcement Learning - An Introduction

Nish · February 3, 2025

Introduction

Actor-Critic methods are a type of reinforcement learning algorithm that combine the benefits of both value-based and policy-based approaches. This blog post aims to provide a high-level overview of these methods.

Background

Reinforcement Learning (RL) is a type of machine learning where an agent learns to make decisions by interacting with an environment. RL methods can be broadly categorized into value-based and policy-based methods. In previous posts, we have discussed policy gradient algorithms along with a brief intro on RL, see here.

Actor-Critic Approach

The core idea of Actor-Critic methods is to combine an “actor” (policy) and a “critic” (value function) where the goal of the actor is to select actions given a state and the critic is to “critque” the choice actions taken by the policy.

Pictorial representation of high level approaches to RL.
Pictorial representation of high level approaches to RL.

More conceretly there are three main approaches to reinforcement learning: value-based, policy-based, and actor-critic methods. In value-based learning, we learn a value function \(Q_\theta(s, a)\) and infer a policy through maximization: $ \pi(s) = \arg\max_a Q(s, a) $. This approach uses an implicit policy. Policy-based learning takes a different route by explicitly learning the policy $ \pi_\theta(a \vert s) $ that maximizes reward across all possible policies, without maintaining a value function. Actor-critic methods combine both approaches by simultaneously learning both a value function and a policy, leveraging the benefits of each approach.

Role of the Actor

  • Learning the optimal policy.
  • Taking actions in the environment.

Role of the Critic

  • Evaluating the actions taken by the actor.
  • Providing feedback to the actor.

Advantages of Actor-Critic Methods

Combining policy and value-based approaches allows Actor-Critic methods to address some limitations of each individual approach, providing a more robust learning framework.

The motivation for using Actor-Critic methods stems from the challenges inherent in both value-based and policy-based RL. As Vapnik famously stated, “When solving a problem of interest, do not solve a more general problem as an intermediate step.” In the context of value-based RL, this suggests that learning a value function can be unnecessarily complex, especially when a simpler policy would suffice. On the other hand, pure policy gradient methods, such as REINFORCE, often suffer from high variance.

This variance issue arises because policy gradients estimate the expected reward by playing a game under the current policy and recording the states, actions, and rewards encountered. While this Monte-Carlo sampling approach provides an unbiased estimate, it can lead to high variance, meaning the policy updates might move in suboptimal directions. This in turn can make things even harder to recover from since you then repeat the process with the updated policy which is now even further from the truth.

To illustrate, imagine playing $ N = 1000 $ games where the game is formalised as an rl problem and plotting a histogram of the observed rewards; this typically will result in a wide spread which indicates high variance. If you were to then select a sample of rewards e.g $N = 5$ you’d likely notice a high degree of variance between them.

Actor-Critic methods aim to alleviate this variance by incorporating a critic to provide more stable and efficient feedback to the actor. Thinking of this interms of the game anaology above would be like observed a distribution of observed reward which has a slight bias yet more peaked and thus lower variance.

Symbolises the difference between reward estimation process behind policy based methods like REINFORCE and actor-critic based approaches.
Symbolises the difference between reward estimation process behind policy based methods like REINFORCE and actor-critic based approaches.

Don’t get too hung up on the exact choices of the plots themselves they’re just shown to illustrate things.

Note: You can think of this stemming from the fact that the sample mean, used to estimate the expected reward, has its own expectation $\mathbb{E}[\bar{X}]$ and variance $\text{Var}(\bar{X})$. While increasing the number of games $ N $ can reduce the variance, it also increases the computational cost. Actor-Critic methods offer a way to reduce variance without such a steep increase in computation.

Key Components

Example variations in implementation of actor critic methods.
Example variations in implementation of actor critic methods.

Actor Network

  • Architecture: Typically a neural network.
  • Function: Parameterizes the policy.

Critic Network

  • Architecture: Typically a neural network.
  • Function: Estimates the value function (e.g., Q-value or state-value).

Sample implementation, full code here:

import torch.nn as nn
import torch.nn.functional as F

class Policy(nn.Module):
    """
    implements both actor and critic in one model
    """
    def __init__(self):
        super(Policy, self).__init__()
        self.affine1 = nn.Linear(4, 128)

        # actor's layer
        self.action_head = nn.Linear(128, 2)

        # critic's layer
        self.value_head = nn.Linear(128, 1)

        # action & reward buffer
        self.saved_actions = []
        self.rewards = []

    def forward(self, x):
        """
        forward of both actor and critic
        """
        x = F.relu(self.affine1(x))

        # actor: choses action to take from state s_t
        # by returning probability of each action
        action_prob = F.softmax(self.action_head(x), dim=-1)

        # critic: evaluates being in the state s_t
        state_values = self.value_head(x)

        # return values for both actor and critic as a tuple of 2 values:
        # 1. a list with the probability of each action over the action space
        # 2. the value from state s_t
        return action_prob, state_values

Algorithm Flow

  1. The actor takes an action based on the current policy.
  2. The critic evaluates the action and provides feedback (e.g., TD error).
  3. The actor updates its policy based on the feedback from the critic.
  4. The critic updates its value function to better estimate future rewards.

Variants of Actor-Critic Methods

  • A2C (Advantage Actor-Critic)
  • A3C (Asynchronous Advantage Actor-Critic)
  • DDPG (Deep Deterministic Policy Gradient)
  • TD3 (Twin Delayed DDPG)
  • SAC (Soft Actor-Critic)

Applications

Actor-Critic methods are used in various real-world applications, including robotics, game playing, and autonomous driving and most notably for providing personalised recommendations to customers.

Example use case for RL in personalised customer recommendations.
Example use case for RL in personalised customer recommendations.

Conclusion

Actor-Critic methods combine the strengths of value-based and policy-based approaches, making them a powerful tool in reinforcement learning.

Further Resources

Citation Information

If you find this content useful & plan on using it, please consider citing it using the following format:

@misc{nish-blog,
  title = {Actor-Critic Methods in Reinforcement Learning - An Introduction},
  author = {Nish},
  howpublished = {\url{https://www.nishbhana.com/Actor-Critic-Intro/}},
  note = {[Online; accessed]},
  year = {2025}
}

x.com, Facebook