Home

mDP

An MDP, or Markov decision process, is a mathematical framework for modeling decision making in stochastic environments. It represents a decision-maker interacting with an environment over a sequence of time steps, where the outcome depends on both the current state and the action chosen. The model is defined by a five-tuple (S, A, P, R, γ): S is a set of states, A is a set of actions, P(s'|s,a) is the probability of transitioning to state s' when taking action a in state s, R(s,a) is the immediate reward received, and γ ∈ [0,1) is a discount factor that prioritizes sooner rewards.

A policy π specifies the action to take in each state. It can be deterministic (π(s) ∈ A)

Algorithms for solving MDPs include dynamic programming methods such as value iteration and policy iteration when

or
stochastic
(π(a|s)
provides
a
distribution
over
actions).
The
goal
is
to
find
an
optimal
policy
π*
that
maximizes
expected
cumulative
reward,
typically
defined
as
E[Σ_t
γ^t
R(S_t,
A_t)].
Value
functions
V^π(s)
and
Q^π(s,a)
are
used
to
quantify
expected
returns
under
a
policy.
The
Bellman
equations
relate
these
values:
for
a
given
policy,
V^π(s)
=
Σ_a
π(a|s)
Σ_{s'}
P(s'|s,a)
[R(s,a)
+
γ
V^π(s')],
and
the
optimal
Bellman
equation
V*(s)
=
max_a
Σ_{s'}
P(s'|s,a)
[R(s,a)
+
γ
V*(s')].
the
model
P
and
R
are
known;
and
model-free
methods
such
as
Q-learning,
SARSA,
and
various
policy-gradient
or
actor-critic
methods
when
the
model
is
unknown.
In
continuous
state
or
action
spaces,
function
approximation
is
often
used.
Extensions
include
partially
observable
MDPs
(POMDPs),
which
handle
hidden
state
information.
MDPs
underlie
many
applications
in
robotics,
control,
economics,
operations
research,
and
reinforcement
learning.