Home

tAUCB

tAUCB, short for temporal Adaptive Upper Confidence Bound, is a class of sequential decision-making algorithms designed for multi-armed bandit problems in non-stationary environments. It generalizes the classic UCB approach by incorporating time-awareness to handle changing reward distributions over time.

In tAUCB, each arm maintains a time-weighted estimate of its expected reward. This is achieved using either

Variants of tAUCB differ in how they implement weighting, the exact form of the confidence term, and

Applications of tAUCB arise in settings where reward distributions evolve, including online advertising, recommender systems, and

a
sliding
window
of
recent
observations
or
an
exponential
forgetting
mechanism
to
downweight
older
data.
The
exploration
bonus,
or
confidence
term,
reflects
the
effective
sample
size
under
the
chosen
weighting
scheme
and
may
also
incorporate
change-detection
components.
At
each
step,
the
algorithm
selects
the
arm
with
the
highest
upper
confidence
bound,
balancing
currently
observed
rewards
against
the
uncertainty
caused
by
potential
changes
in
the
environment.
how
they
respond
to
detected
changes
(for
example,
by
resetting
estimates
or
adapting
parameters).
Theoretical
analyses
often
focus
on
dynamic
or
tracking
regret,
aiming
for
sublinear
regret
as
a
function
of
time
under
assumptions
about
the
rate
of
environment
change,
such
as
a
bounded
number
of
change
points
or
smooth
time
variation.
autonomous
control.
Limitations
include
sensitivity
to
hyperparameters
like
window
size
or
forgetting
factors
and
potential
lag
in
adaptation
to
rapid
changes.
Compared
with
stationary
UCB,
tAUCB
prioritizes
responsiveness
to
change,
sometimes
at
the
expense
of
sample
efficiency.
See
also
UCB,
discounted
UCB,
sliding-window
UCB,
and
non-stationary
bandits.