Home

POMDPs

POMDP stands for partially observable Markov decision process. It generalizes the classical MDP to settings where the agent does not directly observe the underlying state. A POMDP is defined by a finite set of states S, a set of actions A, a set of possible observations O, a state transition model T(s'|s,a) that gives the probability of moving to state s' from s after action a, an observation model Ω(o|s',a) giving the probability of observing o after reaching s' when taking a, and a reward function R(s,a) or R(s,a,s'). The agent starts with an initial belief b0, a probability distribution over states, and the goal is to select actions to maximize expected cumulative reward over time despite partial observability.

In a POMDP, the agent maintains a belief state b, representing a probability distribution over states, and

A policy in a POMDP maps belief states to actions. The objective is to maximize the expected

Exact solutions operate in the belief space and are generally intractable for large problems; practical methods

Related topics include hidden Markov models and Bayesian filtering.

updates
it
using
Bayes'
rule
after
each
action-observation
pair.
Upon
taking
action
a
and
receiving
observation
o,
the
updated
belief
b'
assigns
probability
proportional
to
Ω(o|s',a)
∑_{s}
T(s'|s,a)
b(s).
The
normalization
constant
ensures
the
components
sum
to
one.
The
belief
state
evolves
deterministically
given
the
policy
and
observations.
discounted
sum
of
rewards
starting
from
b0.
The
POMDP
value
function
V(b)
is
the
optimal
expected
return
from
belief
b
and
satisfies
a
Bellman-like
equation:
V(b)
=
max_a
[
R(b,a)
+
γ
∑_o
P(o|b,a)
V(b')
],
where
b'
is
the
updated
belief
after
observing
o.
rely
on
approximations
such
as
point-based
value
iteration,
Monte
Carlo
Tree
Search
variants
like
POMCP,
and
approaches
such
as
QMDP
or
belief-space
policy
search.
Applications
include
robotics
and
planning
under
uncertainty.