Home

MultiArmedBanditVerfahren

The multi-armed bandit problem (MAB) is a framework for sequential decision making under uncertainty. In the canonical stochastic setting, there are K arms (options). At each round t, the learner selects an arm a_t and receives a reward r_t drawn from an unknown distribution associated with that arm. The reward distributions are assumed stationary and independent across rounds. The goal is to maximize cumulative reward over a time horizon T, or equivalently to minimize regret.

Let μ_k denote the mean reward of arm k, and μ* = max_k μ_k the best mean. The cumulative

Common algorithms include: epsilon-greedy, which explores randomly with probability ε and exploits the best-so-far arm otherwise; Upper

Variants exist: contextual bandits incorporate side information (contexts) that affect reward distributions; non-stationary bandits allow arm

Applications span online advertising and recommendation systems, A/B testing, clinical trial design, and adaptive routing. The

regret
after
T
rounds
is
R_T
=
T
μ*
−
E[∑_{t=1}^T
r_t].
In
stochastic
MAB,
regret
typically
grows
sublinearly
in
T
with
appropriate
algorithms,
allowing
average
regret
per
round
to
vanish.
Confidence
Bound
(UCB)
methods
such
as
UCB1,
which
select
the
arm
with
the
largest
upper
confidence
bound
on
its
mean;
and
Thompson
sampling
(Bayesian
bandits),
which
samples
arm
means
from
posterior
distributions
and
acts
on
the
sample.
means
to
drift;
adversarial
bandits
relax
stochastic
assumptions;
and
combinatorial
or
finite-horizon
variants
address
different
action
spaces
and
time
scales.
MAB
framework
provides
principled
tools
for
balancing
exploration
and
exploitation
in
settings
where
feedback
is
partial
and
delayed.