mDP
An MDP, or Markov decision process, is a mathematical framework for modeling decision making in stochastic environments. It represents a decision-maker interacting with an environment over a sequence of time steps, where the outcome depends on both the current state and the action chosen. The model is defined by a five-tuple (S, A, P, R, γ): S is a set of states, A is a set of actions, P(s'|s,a) is the probability of transitioning to state s' when taking action a in state s, R(s,a) is the immediate reward received, and γ ∈ [0,1) is a discount factor that prioritizes sooner rewards.
A policy π specifies the action to take in each state. It can be deterministic (π(s) ∈ A)
Algorithms for solving MDPs include dynamic programming methods such as value iteration and policy iteration when