Home

A3C

A3C, short for Asynchronous Advantage Actor-Critic, is a value-based reinforcement learning algorithm that combines multiple parallel actor-learner processes with a shared, global network. Introduced by DeepMind researchers in 2016, it was designed to stabilize and speed up training for deep policies by decorrelating the data collected by agents.

In A3C, several worker agents run in parallel in separate instances of the environment. Each worker maintains

The learning objective combines three components: a policy gradient term using the advantage A(s, a) = Rt

A3C is applicable to a range of tasks, notably discrete-action environments such as Atari games, and often

its
own
copy
of
the
neural
networks,
including
a
policy
network
π(a|s;
θ)
and
a
value
network
V(s;
θv).
Workers
interact
with
their
environments,
compute
n-step
returns
and
advantages,
and
asynchronously
push
gradients
to
the
global
network.
Periodically,
workers
synchronize
their
local
parameters
with
the
global
parameters.
The
asynchronous
updates
reduce
the
temporal
correlations
that
hinder
learning
and
make
efficient
use
of
multi-core
CPUs.
−
V(s;
θv),
a
value
loss
that
minimizes
the
squared
error
between
the
observed
return
Rt
and
the
value
estimate,
and
an
entropy
bonus
to
encourage
exploration.
Returns
are
typically
computed
with
n-step
bootstrapping,
balancing
bias
and
variance.
A3C
is
on-policy,
as
updates
are
based
on
data
generated
by
the
current
policy,
and
uses
shared
parameterization
to
learn
both
the
policy
and
the
value
function.
employs
convolutional
networks
for
processing
visual
input.
It
does
not
rely
on
a
replay
buffer,
which
distinguishes
it
from
off-policy
methods.
A
synchronous
variant,
A2C,
generalizes
the
approach
by
removing
asynchrony,
while
subsequent
methods
have
built
on
its
ideas
to
create
more
scalable
distributed
RL
algorithms.