Home

ActorCritic

Actor-critic methods are a class of reinforcement learning algorithms that combine two function approximators: an actor that represents the policy and a critic that evaluates the policy by estimating value functions. The actor outputs a policy pi(a|s; theta), mapping states to a distribution over actions. The critic estimates a value function V(s; w) or Q(s,a; w), providing feedback used to improve the policy.

During learning, the critic is trained to minimize temporal-difference (TD) error between observed rewards and estimated

On-policy variants include Advantage Actor-Critic (A2C) and its asynchronous version A3C, where learning uses data collected

Actor-critic methods are prominent in continuous-action domains due to their ability to learn stochastic or deterministic

value.
The
actor
is
updated
using
a
policy
gradient
that
uses
the
critic's
estimate,
often
via
an
advantage
function
A(s,a)
=
Q(s,a)
-
V(s)
or
directly
via
TD
error,
to
reduce
variance
and
bias
in
updates.
by
the
current
policy.
Off-policy
variants
include
Deep
Deterministic
Policy
Gradient
(DDPG)
and
its
improvements
TD3
and
Soft
Actor-Critic
(SAC),
which
use
replay
buffers
and
target
networks
to
stabilize
training.
policies.
Challenges
include
stability,
hyperparameter
sensitivity,
and
balancing
exploration.
They
are
often
paired
with
entropy
regularization
to
encourage
exploration.