Home

Onpolicy

On-policy reinforcement learning refers to a class of methods in which the agent learns about and improves the policy it is currently following. The data used to update the policy are generated by executing that same policy, so the learning process is tightly coupled to the behavior policy being updated.

In on-policy methods, policy evaluation and policy improvement are carried out using trajectories produced by the

On-policy methods can offer stability and simpler convergence properties in some environments because the updates are

By contrast, off-policy methods learn about a policy using data gathered from potentially different policies, enabling

In practice, the choice between on-policy and off-policy approaches depends on the task, data availability, and

current
policy.
This
means
that
the
agent
cannot
easily
reuse
data
collected
under
a
different
policy
without
corrections,
making
the
approach
more
straightforward
theoretically
but
often
less
data-efficient
in
practice.
Common
on-policy
algorithms
include
SARSA,
certain
policy
gradient
methods
such
as
REINFORCE,
and
modern
actor-critic
approaches
like
A2C
and
PPO
in
their
standard
on-policy
forms.
based
on
data
that
accurately
reflects
the
policy
being
optimized.
However,
they
typically
require
fresh
data
for
each
update
and
can
be
sensitive
to
exploration
strategies,
which
can
limit
sample
efficiency
and
slow
learning
in
complex
tasks.
re-use
of
past
experience
and
greater
data
efficiency
but
often
at
the
cost
of
increased
algorithmic
and
statistical
complexity.
Examples
of
off-policy
approaches
include
Q-learning,
DQN,
and
certain
deep
deterministic
policy
gradient
variants.
computational
considerations.
On-policy
methods
are
commonly
used
when
stable,
reliable
learning
from
current
behavior
is
prioritized.