Home

Temporaldifference

Temporaldifference, typically called temporal-difference (TD) learning, is a family of model-free reinforcement learning methods used to estimate value functions and guide policy decisions from ongoing interaction with an environment. TD methods combine ideas from Monte Carlo methods and dynamic programming by bootstrapping: they update value estimates using other learned estimates rather than waiting for complete episode outcomes. This enables online learning and suitability for continuing tasks.

In its simplest form, TD updates adjust the value of a state or state–action pair after each

V(S_t) ← V(S_t) + α [R_{t+1} + γ V(S_{t+1}) − V(S_t)],

where α is the learning rate and γ is the discount factor. For an action-value function Q, the

Q(S_t, A_t) ← Q(S_t, A_t) + α [R_{t+1} + γ Q(S_{t+1}, A_{t+1}) − Q(S_t, A_t)].

Variants of TD include TD(λ), which uses eligibility traces to blend observed rewards over multiple past steps,

Convergence properties vary by setting. In tabular problems or with certain function approximators and policies, TD

Applications of temporal-difference learning span robotics, game playing, autonomous control, and other domains where agents learn

step.
For
a
state-value
function
V,
the
TD(0)
update
is
update
is
and
encompasses
TD(0)
as
a
special
case
(λ
=
0).
Notable
TD-based
algorithms
include
Q-learning,
an
off-policy
method
that
targets
the
optimal
action-value
function,
and
SARSA,
an
on-policy
method
that
updates
using
the
actual
action
taken
by
the
current
policy.
methods
can
converge
to
correct
values
under
appropriate
conditions.
When
using
function
approximation,
especially
with
nonlinear
models,
convergence
can
be
more
delicate
and
may
require
additional
techniques.
from
continuous
experience
without
requiring
a
model
of
the
environment.
It
remains
a
central
concept
in
reinforcement
learning
research
and
practice.