POMDPs

POMDP stands for partially observable Markov decision process. It generalizes the classical MDP to settings where the agent does not directly observe the underlying state. A POMDP is defined by a finite set of states S, a set of actions A, a set of possible observations O, a state transition model T(s'|s,a) that gives the probability of moving to state s' from s after action a, an observation model Ω(o|s',a) giving the probability of observing o after reaching s' when taking a, and a reward function R(s,a) or R(s,a,s'). The agent starts with an initial belief b0, a probability distribution over states, and the goal is to select actions to maximize expected cumulative reward over time despite partial observability.

In a POMDP, the agent maintains a belief state b, representing a probability distribution over states, and

A policy in a POMDP maps belief states to actions. The objective is to maximize the expected

Exact solutions operate in the belief space and are generally intractable for large problems; practical methods

Related topics include hidden Markov models and Bayesian filtering.

action-observation

a

deterministically

b

a

=

[

+

γ

],