Temporaldifference

Temporaldifference, typically called temporal-difference (TD) learning, is a family of model-free reinforcement learning methods used to estimate value functions and guide policy decisions from ongoing interaction with an environment. TD methods combine ideas from Monte Carlo methods and dynamic programming by bootstrapping: they update value estimates using other learned estimates rather than waiting for complete episode outcomes. This enables online learning and suitability for continuing tasks.

In its simplest form, TD updates adjust the value of a state or state–action pair after each

V(S_t) ← V(S_t) + α [R_{t+1} + γ V(S_{t+1}) − V(S_t)],

where α is the learning rate and γ is the discount factor. For an action-value function Q, the

Q(S_t, A_t) ← Q(S_t, A_t) + α [R_{t+1} + γ Q(S_{t+1}, A_{t+1}) − Q(S_t, A_t)].

Variants of TD include TD(λ), which uses eligibility traces to blend observed rewards over multiple past steps,

Convergence properties vary by setting. In tabular problems or with certain function approximators and policies, TD

Applications of temporal-difference learning span robotics, game playing, autonomous control, and other domains where agents learn

a

a

(λ

=

a

a