MultiArmedBanditVerfahren

The multi-armed bandit problem (MAB) is a framework for sequential decision making under uncertainty. In the canonical stochastic setting, there are K arms (options). At each round t, the learner selects an arm a_t and receives a reward r_t drawn from an unknown distribution associated with that arm. The reward distributions are assumed stationary and independent across rounds. The goal is to maximize cumulative reward over a time horizon T, or equivalently to minimize regret.

Let μ_k denote the mean reward of arm k, and μ* = max_k μ_k the best mean. The cumulative

Common algorithms include: epsilon-greedy, which explores randomly with probability ε and exploits the best-so-far arm otherwise; Upper

Variants exist: contextual bandits incorporate side information (contexts) that affect reward distributions; non-stationary bandits allow arm

Applications span online advertising and recommendation systems, A/B testing, clinical trial design, and adaptive routing. The

T

=

T

μ*

−

T